What your Organization’s Cloud can Learn from AT&T’s Outage
The week of Feb. 22 was not a great one for AT&T; and that is by their own admission, as seen in a Letter from AT&T CEO John Stankey that was released to the press Sunday, Feb. 25. Classified by many as a “major service interruption,” as the United States time zones “woke up” for the day, millions of Americans soon came to the realization that their wireless devices were unable to access the AT&T nationwide network. According to CNN, by midday, service restoration had begun to progress, and over the next several hours, AT&T wireless network stabilized, and systems were working, “as expected,” before the end of the business day.
How a cell network outage took out cloud services
Why is this important?
We need to back up a bit to gain a little perspective. The internet is a series of globally interconnected telecommunications networks, connected to other globally interconnected telecommunications networks, allowing network traffic access to hosts and services that reside on those interconnected networks. The “network of networks” topology is foundational to the ISP 3-Tier Service Model, which defines the different types of telecommunications providers, globally.
AT&T, as an Internet Service Provider (ISP) is classified as Tier 1, which is more often referred to as “backbone” carriers. A collection of Tier 1 ISPs, and their collective interconnected telecommunications networks form what most consider, “the backbone of global internal infrastructure.” In other words, AT&T is among the “top tier” ISPs in the United States, and still, with all the knowledge, infrastructure, leadership, and resources they have access to as a global telecommunications provider, affirms them as such.
Yet, from a customer perspective, this “Tier 1 global telecommunications superpower” is still a single point of failure.
Expert vs. Novice
A novice hasn't achieved a level of proficiency as to be aware of the limits of their knowledge. An expert understands that expertise is not a destination; it is a practice, a destination always on the horizon, but never sufficiently under their own feet.
What can we learn from this failure?
What can we do to apply that knowledge within our own organizations? That is the question we should all be asking right now. Not “how can we prevent an outage like this in the future?” but rather, “how can we architect and implement solutions that are resolute, or graceful, around failure?”
There will always be outages that we cannot control. Dependencies will break in ways that no one could have conceived of ahead of time. Failure is inevitable. Use this moment to learn from someone else’s failure. Test your architecture, debate about different approaches with colleagues, challenge assumptions that seem foundational. The one gift that failure will always give us is the opportunity to learn, to accept feedback from a new source, to take in the latest information and make it relevant in our own work.
Is AT&T a great carrier? Yes, they are. Is AT&T infallible? No, they are not. No product, carrier, platform, design, solution, system, network or component is now or will be infallible. As organizations, and leading technologists, that is a risk we assume when we build.
The experience, knowledge, and trust that we cultivate with teams that you build and maintain our systems is within our control, however. Should we only hire experts? No. Build a team of Senior Software Engineers, credentialed System and Solution Architects and other Seniors to show your novice staff how to not only build for resiliency, but also now to recover from unexpected failure. Not sure how to build a codependent team that can setup successful pathways to learn from each other? Find a better partner that can (Psst, that’s us!).