RPO and RTO – Backup and Restore Strategies – SOA-C02 Study Guide

RPO and RTO

Whenever you are choosing any backup strategy, you need to also define the recovery-point objective (RPO) and the recovery-time objective (RTO). The RPO is used to define how much data can be lost during an event that requires you to restore data, and the RTO defines the time allowed to recover the data and bring it fully online.

For example, an RPO of one hour means that you can lose no more than one hour’s worth of data. You should thus select a backup procedure that will capture data every hour at a minimum. If possible, capturing more frequently than each hour ensures that you have not one but several recovery points to return to because the latest one might be incomplete, corrupted, or not reflect a valid point to return to.

An RTO of one hour means that you need to bring the application back to the state before the event within at most one hour. To achieve the lowest possible RTO, you need to employ a good recovery strategy that is simple to execute and as automated as possible.

Figure 6.1 represents the RPO and RTO and how these two factors relate to an event that disrupts an application.

FIGURE 6.1 RTO and RPO

Disaster Recovery

Another term that we need to cover when discussing backups is disaster recovery. The idea of disaster recovery stems from traditional datacenters, where computing was done in one location (usually due to cost and proximity to the clients/workforce). Traditionally, disaster recovery outlines a plan to recover the primary operating environment from remotely stored backups. These backups could be represented by any type of data storage, ranging from tapes or disks, cold-stored in a remote location, that can be returned to the primary environment and restored, to online storage and standby servers in a remote location that can be activated to take the production load at any moment, and everything in between.

The traditional concept implies that there are so-called cold resources—tapes, disks, storage, servers that are not in use in day-to-day operations because they need to be made available at all times in case of a disaster. This can lead to expensive hardware being unused and can dramatically increase the operating expenses of any application.

In AWS, there is no concept of cold datacenters. All of the equipment is online at all times to deliver AWS services; however, the load on the equipment is never even close to 100 percent. This means that there is always some spare capacity across each AWS datacenter, availability zone, and region that you can use for the purpose of disaster recovery. Because AWS services are pay-per-use, this implies that you never have to pay for any resources that are “waiting” for you to fail over to.

Overview of Backup Strategies

One thing to consider in AWS is that any backup can also be used as a seed for disaster recovery. To ensure you can both recover data and restore a complete environment in case of disaster, you have to select a backup strategy. There are four general ways to set up the backup of your environment that will also support full disaster recovery:

Backup and restore

Pilot light

Warm standby

Multisite active-active

Regardless of whether your primary environment is on-premises or AWS, the design of these strategies works exactly the same. In the following sections, we refer to the production environment, which represents the primary site or primary AWS region, and the backup environment, which represents an AWS region where backups are stored and where disaster recovery can be initiated.