Backup and Restore – Backup and Restore Strategies – SOA-C02 Study Guide

Backup and Restore

The simplest option is backup and restore. All stateful AWS services support some sort of backup. Backup and restore can be a great strategy when the RPO and RTO are long (typically hours) because the approach is very low cost and also very easy to implement. The cheapest backup and restore approach can be implemented within one region, but we recommend that you always replicate a copy of regional backups to another region, which gives you peace of mind in case of a regional outage due to a major disaster. A simple way is to copy production data with a predefined schedule (once an hour, for example) to S3. If there is a need to restore, you can simply recover the object from S3. In case of a disaster recovery, you can deploy a new instance in the backup environment and restore the data on to the newly deployed instance. The built-in disaster recovery in this scenario has negligible cost.

Pilot Light

Pilot light builds on the backup and restore approach, by providing some services that can be easily and quickly started in case of a disaster. The RPO can be lowered from hours to minutes by ensuring you replicate data more often or even by replicating a database to a cross-region read replica. This approach ensures that the loss of data is minimal and could be as low as a few minutes or even seconds, but the RPO/RTO goal of a pilot light should always be set to the “worst-case scenario” of perhaps tens of minutes. To speed up the RTO, you can prepare images or even set up powered-off EC2 instances in the backup environment. In case of a disaster, the deployment can optimally be done within minutes because the instances are redeployed from the AMI or even just started up. The cost of the pilot light strategy can be very low; in addition to the backups, you need to consider any running resources such as a read replica of your production database.

Warm Standby

A warm standby elaborates on the pilot light strategy by having a small subset of AWS services operational at all times. This is particularly important when the RPO and RTO are very low and the whole application needs to be up and running within a few minutes at worst. In case of a disaster, the warm standby can be made primary and scaled out to support production traffic. There will be some outage when you redirect the traffic and during scale-out because the DNS of the application will take tens of seconds to reflect the change and the scale-out process might take another minute or two on top of that. This solution is rather costlier because it does require you to have an active, lower-capacity site always up and running. However, you can always also trickle a subset of traffic through the warm standby, thus continuously verifying that the application can be failed over at any time.