- Posted May 5, 2014 by
Calgary Disaster Averted: Historic Flood Required Emergency Switch to MPLS Network
It was either divine intervention or fortuitous chance that long before the historic floods stuck Calgary in the summer of 2013, city officials had been planning an extensive upgrade of local area networking and web-based communications involving migrating all sites to a Multiprotocol Label Switching (MPLS) network. The extraordinarily lucky planning and preparation for this high-tech evolution in city systems gave the city a chance to save its networks when catastrophe struck. What the city planners never imagined, of course, was that in order to save the whole system, they would have to accomplish in mere days what was supposed to take years.
With a population of a well over a million residents, Calgary is the largest city in Alberta, a province in western Canada. The city is located not far north of the Montana border, and is perhaps most famous for hosting the Winter Olympics in 1988. Calgary's Olympic experience is relevant here, because hosting the Games is an enormously complex exercise in city planning and infrastructure design, as well as preparation for emergency management. It is precisely those skills and that experience which may have allowed Calgary to avoid what might have been one of the worst tech disasters in modern history.
The fundamental problem - and the issue that created the disaster in the first place - was that the original designers of Calgary's network systems created a singular point of weakness as vulnerable as the exhaust port in the Death Star in the original Star Wars movie. In simple terms when the city's point-to-point network was designed, fully half of them were connected via fiber optic cables to single node with Calgary's city hall building. If anything disastrous were to befall the City Hall building, the network would go down hard - and perhaps permanently. In 2014, no network designer would permit such a singular vulnerability. The terrorist attacks of September 11th, 2001, forever changed the design thinking which often focused on centralization and convenience of network maintenance. After 9/11, it was understood that the risk of putting all of the technological eggs in one basket was just too high.
But Calgary's system was designed prior to that date changed everything. Unfortunately, there were other problems inherent in the design, most notably the fact that Calgary City Hall is housed in an old government building that rests on a flood plain. After reviews were conducted of city infrastructure in the early 2000s, these weaknesses became clear and formed the impetus for decisions to eventually migrate the sites to an MPLS network. And it was critical to make the move fast. As Gary at BT IP Connect says.
"MPLS-based services improve disaster recovery in a variety of ways. First and foremost data center and other key sites can be connected in multiply redundant ways to the cloud (and thus to other sites on the network). Secondly, remote sites can quickly and easily reconnect to backup locations if needed (unlike with ATM and frame networks, in which either switched or backup permanent-virtual-circuits are required)."
In fact, a year before the 2013 flood, the city had already completed a move of its critical data center to a new location far from City Hall - and not in a vulnerable flood plain. Without that first decision, which amounted to the first in a chain of very lucky circumstances, the network would have been doomed once the rains began to fall.
So the stage is set in early summer 2013. The data center had been moved to a new location, and the MPLS network was in place. At that point, project managers still expected the migration of data and sites to the new network to take at least 12-18 months. There was no rush, after all - plenty of time to carefully test the new network before migration commenced. Plans were already in place to begin that testing, in order to ensure no loss of data once the migration began.
And then in early June 2013, the rain began to fall. In a semi-arid part of Canada not use to significant rainfall, an enormous deluge fell from the sky. For two weeks, it rained almost unceasingly, at one point dropping 13 inches of rain on Calgary in less than two days. The lakes rose and the rivers crested their banks, and the city flooded to once-in-a-century levels.
By the 18th of June, it became unmistakably clear that much of the city would soon be underwater, to specifically include the old City Hall building. By the time city official notified project managers the flooding of the government building was imminent, they had a matter of hours to save the network. The technical engineers running the show realized to their horror that if the building flooded, the network was lost. And they had no more than 3 hours to find a way to prevent the disaster.
There was only one possible answer, of course, and the answer only existed because fate had already intervened. The transfer of the data center the summer before gave planners some hope; if that had not already been completed, all would have been lost. But what about the extensive testing planned prior to migration to the MPLS network?
Engineers and officials realized there was no time and nothing to lose. If they waited, the system was lost. There was no time to test the network, and no time for carefully designed phased migration. The decision was made: do the whole thing right now and today, with no testing and no phased-in migration. In short, the project managers were told: you have 3 hours to accomplish a year-long project.
With no other choice, the project manager gave the order. Miraculously, for the most part it succeeded. Data was lost, but within a couple days the vast majority of the sites were up and running. The MPLS network migration worked, thanks largely to careful planning and design, and to the good fortune of having the right piece in the right places when disaster happened.
A year later, city planners are still working to refine the network and fix minor damage, but for the most part they pulled off the impossible. And they've learned some important lessons in the fatal weakness of network centralization - especially when the hub is on a flood plain.