In this section, you will see how RDS automatic backups and manual snapshots work. These features come with Amazon RDS.
Let’s consider a database that is scheduled to take a backup at 5 A.M. every day. If the application fails at 11 A.M., then it is possible to restart the application from the backup taken at 11 A.M. with the loss of 6 hours’ worth of data. This is called a 6-hour Recovery Point Objective (RPO). The RPO is defined as the time between the most recent backup and the incident, and this determines the amount of data loss. If you want to reduce this, then you have to schedule more incremental backups, which increases the cost and backup frequency. If your business demands a lower RPO value, then the business must spend more to provide the necessary technical solutions.
Now, according to our example, an engineer was assigned the task of bringing the system back online as soon as the disaster occurred. The engineer managed to bring the database online at 2 P.M. on the same day by adding a few extra hardware components to the current system and installing some updated versions of the software. This is called a 3-hour Recovery Time Objective (RTO). The RTO is determined as the time between the disaster recovery and full recovery. RTO values can be reduced by having spare hardware and documenting the restoration process. If the business demands a lower RTO value, then your business must spend more money on spare hardware and an effective system setup to perform the restoration process.
In RDS, the RPO and RTO play an important role in the selection of automatic backups and manual snapshots. Both of these backup services use AWS-managed S3 buckets, which means they cannot be visible in the user’s AWS S3 console. They areRegion-resilient because the backup is replicated into multiple Availability Zones in the AWS Region. In the case of a Single-AZ RDS instance, the backup happens from the single available data store, and for a Multi-AZ enabled RDS instance, the backup happens from the standby data store (the primary store remains untouched as regards the backup).
The snapshots are manual for RDS instances, and they are stored in the AWS-managed S3 bucket. The first snapshot of an RDS instance is a full copy of the data and the onward snapshots are incremental, reflecting the change in the data. In terms of the time taken for the snapshot process, it is high for the first one and, from then on, the incremental backup is quicker. When any snapshot occurs, it can impact the performance of the Single-AZ RDS instance, but not the performance of a Multi-AZ RDS instance as this happens on the standby data storage. Manual snapshots do not expire, have to be cleared automatically, and live past the termination of an RDS instance. When you delete an RDS instance, it suggests making one final snapshot on your behalf and it will contain all the databases inside your RDS instance (there is not just a single database in an RDS instance). When you restore from a manual snapshot, you restore to a single point in time, and that affects the RPO.
To automate this entire process, you can choose a time window when these snapshots can be taken. This is called an automatic backup. These time windows can be managed carefully to essentially lower the RPO value of the business. Automatic backups have a retention period of 0 to 35 days, with 0 being disabled and the maximum is 35 days. To quote AWS documentation, retained automated backups contain system snapshots and transaction logs from a database instance. They also include database instance properties such as allocated storage and a database instance class, which are required to restore it to an active instance. Databases generate transaction logs, which contain the actual change in data in a particular database. These transaction logs are also written to S3 every 5 minutes by RDS. Transaction logs can also be replayed on top of the snapshots to restore to a point in time of 5 minutes’ granularity. Theoretically, the RPO can be a 5-minute point in time.
When you perform a restore, RDS creates a new RDS instance, which means a new database endpoint to access the instance. The applications using the instances have to point to the new address, which significantly affects the RTO. This means that the restoration process is not very fast, which affects the RTO. To minimize the RTO during a failure, you may consider replicating the data. With replicas, there is a high chance of replicating the corrupted data. The only way to overcome this is to have snapshots and restore an RDS instance to a particular point in time prior to the corruption. Amazon RDS Read Replicas are unlike the Multi-AZ replicas. In Multi-AZ RDS instances, the standby replicas cannot be used directly for anything unless a primary instance fails, whereas Read Replicas can be used directly, but only for read operations. Read replicas have their own database endpoints and read-heavy applications can directly point to this address. They are kept in sync asynchronously with the primary instance. Read Replicas can be created in the same Region as the primary instance or in a different Region. Read Replicas in other Regions are called Cross-Region Read Replicas and this improves the global performance of the application.
As per AWS documentation, five direct Read Replicas are allowed per database instance and this helps to scale out the read performances. Read Replicas have a very low RPO value due to asynchronous replication. They can be promoted to a read-write database instance in the case of a primary instance failure. This can be done quickly and it offers a fairly low RTO value.
In the next section, you will learn about Amazon’s database engine, Amazon Aurora.