This chapter covers the following official AWS Certified SysOps Administrator – Associate (SOA-C02) exam domains:
Domain 2: Reliability and Business Continuity
Domain 5: Networking and Content Delivery
(For more information on the official AWS Certified SysOps Administrator – Associate [SOA-C02] exam topics, see the Introduction.)
If you can correctly answer these questions before going through this section, save time by skimming the Exam Alerts in this section and then completing the Cram Quiz at the end of the section.
1. You operate a 99.9 percent HA application in an AWS region. You have received an SLA update for your application that has raised the three nine requirements to a four nine requirement. What would be the correct course of action in this scenario?
2. When is an application considered to be both highly available and resilient?
1. Answer: Establish another full replica of the application to increase the availability to four nines.
2. Answer: The application requires at least two complete replicas to be deployed. Each replica must be able to handle the failure of the other replica and accept 100 percent of the network traffic at all times.
An application can be made elastic and scalable if you can easily increase or decrease the capacity when required to meet the demand. However, high availability and resilience are not guaranteed even if an application is fully scalable and elastic. High availability and resilience depend on how you deploy and operate the application, and both of these factors need to be considered independently of scalability and elasticity.
High availability is defined as a factor of availability, and resilience is defined as the ability to maintain the application’s availability in case of failures and errors in the application or the infrastructure. High availability is commonly referred to as “one nine”—meaning 90 percent; “two nines”—meaning 99 percent; “three nines” —meaning 99.9 percent; “four nines”—meaning 99.99 percent; and so on. Table 5.1 provides a breakdown of high availability definitions.
TABLE 5.1 Uptime Percentage Chart
Availability | Downtime Per Day | Downtime Per Month | Downtime Per Year |
90%, “one nine” | 2.4 Hours | 72 Hours | 36.5 Days |
95% | 1.2 Hours | 36 Hours | 18.25 Days |
98% | 28.8 Minutes | 14.4 Hours | 7.30 Days |
99%, “two nines” | 14.4 Minutes | 7.20 Hours | 3.65 Days |
99.5% | 7.2 Minutes | 3.60 Hours | 1.83 Days |
99.9%, “three nines” | 1.44 Minutes | 43.8 Minutes | 8.76 Hours |
99.95% | 43.2 Seconds | 21.56 Minutes | 4.38 Hours |
99.99%, “four nines” | 8.66 Seconds | 4.38 Minutes | 52.56 Minutes |
99.999%, “five nines” | 0.86 Seconds | 25.9 Seconds | 5.26 Minutes |
99.9999%, “six nines” | 0.086 Seconds | 2.59 Seconds | 31.5 Seconds |
As you can see, the chart could extend to seven nines, eight nines, and so on; however, in reality it becomes impractical to measure downtime beyond five nines in a cloud-connected application because the application is usually connected over the Internet and the typical expected response times of the application might be measured in up to tens of milliseconds. In most cases, a five nines application is considered to essentially be available “all the time.”
When defining a service-layer agreement (SLA), you need to typically define an uptime that your application will theoretically be able to deliver. But what is the projected uptime for a complex, multilayer application? The easiest way to define it is to take the weakest component with the lowest SLA and define the whole application according to that SLA. However, if your application is now below the threshold defined in the SLA, you need to establish another replica of the application and combine the uptimes of both to deliver the new uptime.
For example, an application has a 99.9 percent uptime. To increase the overall uptime, you need to spin up another replica of the application in a different availability zone (AZ) and divide the traffic among them. This second copy also has 99.9 percent uptime. The combined uptime is 100 percent minus a multiple of the failure rates.
In this example, the uptime would be defined as
Uptime = 100% − (0.1% × 0.1%) = 100% − 0.01% = 99.99%
The combined uptime of two three-nines application replicas can deliver an uptime of four nines. However, this considers that both of the applications are capable of receiving the full 100 percent of the network traffic at any time and that any requests to both replicas can be handled and distributed without loss, meaning they are also designed with full resilience.