High Availability and Resilience – SOA-C02 Study Guide

This chapter covers the following official AWS Certified SysOps Administrator – Associate (SOA-C02) exam domains:

Domain 2: Reliability and Business Continuity

Domain 5: Networking and Content Delivery

(For more information on the official AWS Certified SysOps Administrator – Associate [SOA-C02] exam topics, see the Introduction.)

CramSaver

If you can correctly answer these questions before going through this section, save time by skimming the Exam Alerts in this section and then completing the Cram Quiz at the end of the section.

1. You operate a 99.9 percent HA application in an AWS region. You have received an SLA update for your application that has raised the three nine requirements to a four nine requirement. What would be the correct course of action in this scenario?

2. When is an application considered to be both highly available and resilient?

Answers

1. Answer: Establish another full replica of the application to increase the availability to four nines.

2. Answer: The application requires at least two complete replicas to be deployed. Each replica must be able to handle the failure of the other replica and accept 100 percent of the network traffic at all times.

An application can be made elastic and scalable if you can easily increase or decrease the capacity when required to meet the demand. However, high availability and resilience are not guaranteed even if an application is fully scalable and elastic. High availability and resilience depend on how you deploy and operate the application, and both of these factors need to be considered independently of scalability and elasticity.

High availability is defined as a factor of availability, and resilience is defined as the ability to maintain the application’s availability in case of failures and errors in the application or the infrastructure. High availability is commonly referred to as “one nine”—meaning 90 percent; “two nines”—meaning 99 percent; “three nines” —meaning 99.9 percent; “four nines”—meaning 99.99 percent; and so on. Table 5.1 provides a breakdown of high availability definitions.

TABLE 5.1 Uptime Percentage Chart

Availability Downtime Per Day Downtime Per Month Downtime Per Year 
90%, “one nine” 2.4 Hours 72 Hours 36.5 Days
95% 1.2 Hours 36 Hours 18.25 Days
98% 28.8 Minutes 14.4 Hours 7.30 Days
99%, “two nines” 14.4 Minutes 7.20 Hours 3.65 Days
99.5% 7.2 Minutes 3.60 Hours 1.83 Days
99.9%, “three nines” 1.44 Minutes 43.8 Minutes 8.76 Hours
99.95% 43.2 Seconds 21.56 Minutes 4.38 Hours
99.99%, “four nines” 8.66 Seconds 4.38 Minutes 52.56 Minutes
99.999%, “five nines” 0.86 Seconds 25.9 Seconds 5.26 Minutes
99.9999%, “six nines” 0.086 Seconds 2.59 Seconds 31.5 Seconds

As you can see, the chart could extend to seven nines, eight nines, and so on; however, in reality it becomes impractical to measure downtime beyond five nines in a cloud-connected application because the application is usually connected over the Internet and the typical expected response times of the application might be measured in up to tens of milliseconds. In most cases, a five nines application is considered to essentially be available “all the time.”

When defining a service-layer agreement (SLA), you need to typically define an uptime that your application will theoretically be able to deliver. But what is the projected uptime for a complex, multilayer application? The easiest way to define it is to take the weakest component with the lowest SLA and define the whole application according to that SLA. However, if your application is now below the threshold defined in the SLA, you need to establish another replica of the application and combine the uptimes of both to deliver the new uptime.

For example, an application has a 99.9 percent uptime. To increase the overall uptime, you need to spin up another replica of the application in a different availability zone (AZ) and divide the traffic among them. This second copy also has 99.9 percent uptime. The combined uptime is 100 percent minus a multiple of the failure rates.

In this example, the uptime would be defined as

Uptime = 100% − (0.1% × 0.1%) = 100% − 0.01% = 99.99%

The combined uptime of two three-nines application replicas can deliver an uptime of four nines. However, this considers that both of the applications are capable of receiving the full 100 percent of the network traffic at any time and that any requests to both replicas can be handled and distributed without loss, meaning they are also designed with full resilience.