X

This site uses cookies and by using the site you are consenting to this. We utilize cookies to optimize our brand’s web presence and website experience. To learn more about cookies, click here to read our privacy statement.

What your Organization’s Cloud can Learn from AT&T’s Outage 

Author: George Burns Posted In: Cloud

The week of Feb. 22 was not a great one for AT&T; and that is by their own admission, as seen in a Letter from AT&T CEO John Stankey that was released to the press Sunday, Feb. 25. Classified by many as a “major service interruption,” as the United States time zones “woke up” for the day, millions of Americans soon came to the realization that their wireless devices were unable to access the AT&T nationwide network. According to CNN, by midday, service restoration had begun to progress, and over the next several hours, AT&T wireless network stabilized, and systems were working, “as expected,” before the end of the business day. 

How a cell network outage took out cloud services 

On their own, it may be hard to connect the dots. But the AT&T wireless network outage proved to be a rough day for cloud services, as well; especially those that utilize, or possibly depend, on the AT&T wireless network as a part of their network and/or product strategy.    Twilio users, for example, were unable to send text messages in the United States, leading one to assume several things:   Twilio’s network uses AT&T SMS/MMS services to satisfy client API requests,   Twilio’s ability to redirect traffic around the problem might have been limited, as their outage recovery seems to follow recovery of the AT&T network,  consumers of Twilio’s SMS/MMS APIs in the United States experienced similar service outages, impacting business systems around the world.  The point here is not to single out Twilio for how the AT&T wireless network outage affected their users, so much is that the AT&T Wireless Network outage had impacts beyond the scope of its own network.  Think about integrations where SMS messages are used as sole or additional factors in 2FA (two-factor authentication) workflows; what happens to your users then? How can they authenticate if their only selected additional security factor is SMS – and that just went, “poof?”

Why is this important?

We need to back up a bit to gain a little perspective. The internet is a series of globally interconnected telecommunications networks, connected to other globally interconnected telecommunications networks, allowing network traffic access to hosts and services that reside on those interconnected networks. The “network of networks” topology is foundational to the ISP 3-Tier Service Model, which defines the different types of telecommunications providers, globally. 

AT&T, as an Internet Service Provider (ISP) is classified as Tier 1, which is more often referred to as “backbone” carriers. A collection of Tier 1 ISPs, and their collective interconnected telecommunications networks form what most consider, “the backbone of global internal infrastructure.”  In other words, AT&T is among the “top tier” ISPs in the United States, and still, with all the knowledge, infrastructure, leadership, and resources they have access to as a global telecommunications provider, affirms them as such.   

Yet, from a customer perspective, this “Tier 1 global telecommunications superpower” is still a single point of failure.

Two wooden blocks with a person standing on top of them.

Expert vs. Novice

A novice hasn't achieved a level of proficiency as to be aware of the limits of their knowledge. An expert understands that expertise is not a destination; it is a practice, a destination always on the horizon, but never sufficiently under their own feet.

What can we learn from this failure?

What can we do to apply that knowledge within our own organizations? That is the question we should all be asking right now. Not “how can we prevent an outage like this in the future?” but rather, “how can we architect and implement solutions that are resolute, or graceful, around failure?”   

There will always be outages that we cannot control. Dependencies will break in ways that no one could have conceived of ahead of time.  Failure is inevitable. Use this moment to learn from someone else’s failure. Test your architecture, debate about different approaches with colleagues, challenge assumptions that seem foundational. The one gift that failure will always give us is the opportunity to learn, to accept feedback from a new source, to take in the latest information and make it relevant in our own work.   

Is AT&T a great carrier? Yes, they are. Is AT&T infallible? No, they are not. No product, carrier, platform, design, solution, system, network or component is now or will be infallible. As organizations, and leading technologists, that is a risk we assume when we build.

The experience, knowledge, and trust that we cultivate with teams that you build and maintain our systems is within our control, however.  Should we only hire experts? No. Build a team of Senior Software Engineers, credentialed System and Solution Architects and other Seniors to show your novice staff how to not only build for resiliency, but also now to recover from unexpected failure. Not sure how to build a codependent team that can setup successful pathways to learn from each other? Find a better partner that can (Psst, that’s us!).