High Availability and Fault Tolerance in the Public Cloud

Author: George Burns Posted In: AWS, Cloud

Jan 152021

A view of a yellow building against a blue sky in a public cloud.

Deploying applications into the public cloud can have incredible upsides – switching application investment from CapEx to OpEx, being able to retire data centers and colocation space in favor of the elastic infrastructure that cloud platforms provides. Plus, there’s the potential to globally scale any application or service across the service fabric of your choice.

Sounds great, doesn’t it? It is! Or at least it can be. Let’s unpack the public cloud.

What does that mean?

What is the first thing we are taught when we come into IT? Everything will break. Everything. Avoid single points of failure. Back everything up. Never deploy anything without defined recovery plans. Assume that power will fail, that your internet connectivity will experience downtimes and that your code will run into something it cannot handle.

How can we do that when we do not have control over the physical environment? How can we protect against failures in the service fabric of a public cloud provider when we do not even know all the platform components, nor how they interact together?

When the cloud fails

Recently, we had a taste of just what can happen when a portion of the public cloud goes down, when AWS’s Kinesis Data Streams in its US-East-1 went down for almost 18 hours. The trouble was not limited to Kinesis, but rather, encompassed many other AWS services in the same region that also went down, all of which are dependent upon Kinesis, in some form or another.

I am not here to analyze the outage, or dive into the post-mortem, but rather to reinforce a core principal in system architecture – Do not put all your eggs in one basket. Do not deploy your mission critical resources to one AWS region, or even worse a single availability zone. Why? Think back to how we would design data centers a decade ago: Multiple independent subsystems (connectivity, power, cooling, etc.), but also multiple independent deployments.

We did not put everything in a single data center, slap a bow on it and call it a day. Instead, we built “DR Sites,” which we could roll over to, in case of data center outage. We knew then that data centers experience outages and that we needed a backstop to protect out operations in the event of a data center outage. Yet, many of us do not take the same precautions with our cloud deployments.

What is the answer?

There is no one perfect answer. There is no “one size fits all” solution that can be applied to every deployment that will magically protect your infrastructure and applications. Every environment must be properly architected to provide continuity in the event of an outage. Share on X Multi-Region workload distribution and Hybrid cloud implementations are two more obvious solutions, however, just because these options are obvious, does not mean that they are right for everyone.

Architecting for reliability

In designing for reliability there are a few requirements to achieve the desired state of availability. This is increasingly more important when decoupling applications, in that each decoupled layer must be designed for reliability.

For example, if your website runs separate environments for search and shopping cart, then both environments must be independently designed for reliability. Monolithic applications must scale the entire application to account for the highest level of required availability by a single component (if your application contains both search and shopping cart, and assuming the shopping cart requires a higher level of availability than your search app, the entire application must be scaled up and out to achieve the required level of shopping cart reliability, which will most likely lead to an over-provisioning of resources to the search environment).

To achieve high availability, a platform must have 2 separate environments (or, N+1, where N is the number of required environments), each behind a stateful load balancer. If one node is offline, the second node can process requests and the user is none the wiser.

For a deployment to be considered fault tolerant, the same platform must have 3 separate environments (or N+2), with each environment located behind a stateful load balancer. However, fault tolerance has an additional requirement in that these environments must be running in at least 2 physically separate locations.

Reliability and the cloud

Simply put, each environment needs defined reliability metrics, an infrastructure that can heal itself (or be healed with a defined intervention plan), and a disaster recovery plan for when the deployment goes off the rails or becomes unavailable. Business continuation and disaster planning are as important today as they have ever been. Share on X When working with a service fabric that is owned and operated by a third-party, disaster planning needs to encompass the potential for mass service degradation, caused by forces outside of your control, that negatively impacts your environment.

When evaluating the maturity of your cloud deployments, factor in an over-arching disaster recovery plan, even if some of the situations you plan for seem farfetched. Document your planning efforts and test them OFTEN! Every time code is changed, a platform is modified, or a dependency is migrated, the potential for unexpected behavior is introduced, as is the need for deployment hardening. The more you plan now, the lower the chances of phone calls or alerts waking you up in the middle of the night.