Warm Standby – Ensuring Business Continuity – SAP-C02 Study Guide

Warm Standby

The warm standby approach goes a step further compared to the pilot light one. It extends the same concept but also maintains a running copy, although scaled down, of your workload. So, your service is already up and running, and the only thing you need is to scale up the compute resources required by your workload. This approach is illustrated in the following diagram:

Figure 7.4: Warm standby approach

So, understandably, this targets situations where your RTO is too low for both backup and recovery and pilot light scenarios, but not too low so you have enough time to scale up the environment before you can handle the full production load in the new region. This is typically a good fit when your RTO is in the minutes range.

Compared to pilot light, it is even easier to test and validate that your DR plan is fully functional with this approach because you don’t need to take any other action than scaling up to be fully operational.

AWS Services for a Warm Standby Approach

In the warm standby approach, you also use the AWS services already mentioned in the previous two approaches; but, on top, you need to ensure that your workload can rapidly scale up your compute resources to sustain the full production load in the new region. In this case, you’re going to rely on AWS Auto Scaling to monitor the performances of your compute resources and to adjust the capacity as needed. Auto Scaling works with other AWS services such as EC2, ECS, DynamoDB, and Aurora. EKS uses Kubernetes-specific autoscaling mechanisms, such as the Kubernetes Cluster Autoscaler or the recently announced Karpenter to scale cluster resources (such as EC2 Nodes) and the Vertical Pod Autoscaler and Horizontal Pod Autoscaler to scale Pods. You would then need to leverage those to ensure that your EKS clusters and Pods are scaled up to the desired capacity.

Active-Active

The multi-region active-active approach is the ultimate DR approach for the most business-critical workloads, for which none of the previous three approaches could satisfy your RTO and RPO. With this approach, your workload is running concurrently in (at least) two separate regions. This is illustrated in the following diagram:

Figure 7.5: Active-active approach

This approach entails scenarios where you need an RTO of zero (no downtime) and an RPO as close as possible to zero. This, however, comes at a cost since you have a fully functional and scaled-up environment to support your workload in multiple regions (at least two).

Compared to warm standby, because you don’t need to take any action at all, you are already fully operational in multiple regions, and it is even easier to test and validate that your DR plan is fully functional.

AWS Services for an Active-Active Approach

In this multi-region active-active approach, the same AWS services that were mentioned in the previous three approaches remain useful here. They may only be used slightly differently.

For instance, Route 53 or Global Accelerator would be configured to load balance traffic between both active regions and it is only in the case of a failover that they would redirect all traffic to the remaining healthy region.

Regarding data, all the solutions discussed also remain valid options. Your choice will be based on what you need to achieve in terms of RTO and RPO. Reads are not really an issue, since you can always manage either to redirect the reads to a read replica (such as with RDS or Aurora) or to have the concurrency increased automatically (such as with DynamoDB or S3). On the other hand, writes are often a thorn in your side, but you have a number of options to deal with them as given below:

  • You may opt for a “write global” approach, as supported, for instance, by Aurora Global Database, where all the writes converge to a single database in a single region. In case of a failover, there is a bit of downtime, but it is limited to a few minutes (RTO) at most (in general, even less than 1 minute) and your RPO can be really close to zero.
  • You may prefer to take a “write local” approach, such as what you can do with DynamoDB Global Tables. In case of a region failure, there is essentially no downtime since your datastore works in an active-active mode (RTO is zero or very close to zero) and your RPO can also be zero or very close to zero (the missing data from the impaired region will start to be synchronized again when the service comes back up).
  • The last option is to take a “write partitioned” approach, where you write in a given region based on a partition key. This lets you avoid conflicts when writing data. You could use S3 in this case and configure bi-directional cross-region replication to keep the buckets in the two regions in sync.

Now that you have explored the approaches you can take, you are ready to learn how you can make sure that your DR strategy functions.