Automating Recovery for Components Constrained to a Single Location – SAP-C02 Study Guide

Automating Recovery for Components Constrained to a Single Location

In some cases, it will not be possible to run components of the workload across multiple AZs. For example, all nodes of an Amazon EMR cluster are launched in the same AZ to improve job performance thanks to reduced latency and thus a higher data access rate. If a component constrained in an AZ is essential for your workload resilience, you need to set up a mechanism to redeploy that component automatically in another AZ whenever needed.

Whenever it is not possible to deploy the workload to multiple AZs due to technological constraints, you must determine an alternate path to resiliency. Don’t forget to also automate the recreation of the necessary infrastructure, the application’s redeployment, and the data recreation accordingly.

Summary

This chapter covered how you can leverage reliability design principles in designing highly available workloads. Environmental constraints, such as service and account quotas and network topology, were considered first. You then learned how to design workloads to prevent, mitigate, or withstand failure in a distributed environment. You also explored monitoring, leveraging logs, and metrics. Finally, you reviewed how to handle failure by leveraging data backups or multiple AZs or Regions and testing for reliability.

The next chapter will discuss business continuity aspects in detail.

Further Reading

.

7 Ensuring Business Continuity

This chapter will focus on determining a solution design to ensure business continuity. You will look at the different strategies to protect your critical, and less critical, workloads on AWS in case of a disaster.

You will also learn how to design solutions for protecting against a disaster and being able to recover from it, which is paramount to making sure that your business can continue operating. This chapter will guide you through the various possible approaches depending on your needs in terms of business continuity.

The chapter covers the following main topics:

  • Disaster recovery versus high availability
  • Establishing a business continuity plan
  • Disaster recovery options on AWS
  • Detecting and testing disaster recovery

Disaster Recovery versus High Availability

Note the essential definitions for the discussion here. A disaster refers to a large-scale event that impacts a broad geographical area. In AWS terms, a disaster may impair an Availability Zone (AZ), multiple AZs, an entire AWS Region, or, worse, several Regions.

Disaster recovery (DR) is the process that tackles both the prevention of a disaster and the recovery from a disaster.

High availability (HA) addresses how a workload can keep functioning even though some of its components are impacted by a failure.

How do the two compare with each other? Simply put, HA deals with local failures while DR deals with large-scale failures, so they complete each other.

That said, designing for HA on AWS often brings some form of protection for DR at the same time. Why is that? That is because AWS reliability best practices (see Chapter 6, Meeting Reliability Requirements, for more details) recommend measures that also provide some level of protection in case of a disaster. For instance, following AWS best practices, you may have decided to increase the level of resiliency of your workload by distributing it across two or more AZs within a given Region. If your workload serves customers in multiple geographies globally, you may even have designed to run it independently in two or more Regions. In both cases, your design already protects you against some form of disaster – the former against the failure of one AZ, the latter against the failure of one Region.

DR objectives are usually described with two specific KPIs: the recovery time objective (RTO) and the recovery point objective (RPO). The RTO describes the amount of downtime allowed to your workload following a disaster before it is back online. The RPO defines the amount of time between the disaster and the latest data recovery point; in other words, how much data your workload is allowed to lose. Both are time measures, typically expressed in minutes, hours, or even days for the least critical workloads. The solution’s costs and complexity rise as the RTO and/or RPO values decrease.

You will now explore the processes for preventing and planning for a disaster.