AWS Services for a Pilot Light Approach
On top of the backup services already mentioned in the backup and recovery approach, you now have to consider services that can offer continuous replication, in particular, if you need to satisfy a lower RPO.
S3 provides automatic cross-region replication natively. Combined with bucket versioning, it also gives you the ability not only to recover in a separate Region but also, if needed, to carry out a point-in-time recovery of specific versions of your objects.
RDS and Aurora also support continuous cross-region replication with Read Replicas and Global Database, respectively. DynamoDB also offers native multi-region support with its global tables.
Promoting a read replica in Region B, upon failure of the master database in Region A, typically takes a few minutes with RDS. On the other hand, promoting a secondary cluster to the role of a primary cluster with Aurora Global Database can take place in under a minute. That period of downtime will constrain the RTO. In both cases, the RPO is typically measured in seconds, conditioned by the lag between the primary and the replicated data stores.
DynamoDB global tables work differently from the RDS and Aurora multi-region mechanisms. Global tables consist of a set of active-active replicas, all part of a single entity (the global table), maintaining the replication across all of the replicas. All regional replicas act both as a primary and secondary. Therefore, a regional disaster will not have any impact at the database level, except that data replication from and to the impacted region will stop for the duration of the disaster.
Now, in case of a failover, in the pilot light scenario, you don’t have any compute running in the new region yet. But once your workload comes back up in the second region, you need to update any external reference to your workload, such as DNS domain names. If you use Amazon Route 53, you can leverage its health checks mechanism, which will take care of routing the traffic eventually to the healthy endpoints (in the second region). This means that you need to add the new endpoints to your Route 53 configuration once your compute resources become available. There is, however, some time lag, even if you can adjust the DNS record’s time to live (TTL) before Route 53 actually adjusts to the failover. However, if you rely on another mechanism, AWS Global Accelerator, routing failover can happen faster. Global Accelerator associates a set of static IPs to multiple endpoints and essentially routes the traffic to the closest and healthiest endpoint from the AWS edge network through the AWS backbone, and routing failover happens faster because Global Accelerator does not have the same caching mechanism used by DNS services such as Route 53.
In the case of the pilot light strategy, there is one AWS service that can also play a key role, CloudEndure Disaster Recovery. That solution does a block-level replication of virtual machines (VMs) from a given source environment to the target environment. The source environment can be on AWS but also on-premises. Your target environment in this scenario is your AWS environment in the second Region. CloudEndure Disaster Recovery will continuously replicate the VMs over so that, in case of a disaster in the source region, you are ready to fail over and start the instances in the second region. So, on top of backup, continuous data replication, and IaC practices, this is one more practical tool in your SA toolkit that can be quite handy with a pilot light approach.
At the time of writing, AWS launched AWS Elastic Disaster Recovery, which is now the recommended service for pilot light DR, becoming the successor of CloudEndure Disaster Recovery. It does not yet have feature parity with CloudEndure’s solution, but it is only a matter of time before it does.