The first step, before you can take any countermeasures, is to detect that a disaster is actually taking place. Your recovery objectives (RTO and RPO) will dictate how much time you actually have to do so. Consider a situation where you have an RTO of 4 hours with an RPO of 1 hour. This implies that you have up to 4 hours to recover in case of a disaster, but you cannot lose more than an hour’s worth of data. It also means that, whenever a disaster occurs, you must be able to detect the event rapidly enough to notify the stakeholders, escalate if needed, and trigger the DR response within 1 hour (to meet your RPO).
There are a number of things you can do to make sure to detect disasters on time.
Firstly, AWS offers a general service health dashboard that you can check to get the latest status information about AWS services in near real-time services. You can also subscribe to any of the associated RSS feeds to be notified when a specific AWS service goes down. Secondly, AWS provides the AWS Personal Health Dashboard (PHD), which lists the service events that affect your workloads. It presents both the ongoing events as well as the past events with the history of events that occurred in the past 90 days. Now, in some cases that may not be enough. If you have a very stringent RPO and/or RTO, you may need to rely on proactive detection methods, such as health checks, to detect disasters in a timely manner. Going into details on how to meticulously design health checks to assist you effectively is beyond the scope of this book; however, you are encouraged to read the Implementing health checks whitepaper referenced in the Further Reading section at the end of this chapter. You will have to make sure that the health checks you put in place actually help you detect breaches of your business KPIs and effectively identify disaster conditions early enough to meet your RPO and RTO.
Once your disaster detection is in place, as always in IT, testing is paramount to validate that you can meet your DR objectives.
The above is even more crucial in the case of business continuity as the sustainability of your business relies on its ability to survive a disaster. In this case, what you want to validate is that you can meet your RTO and RPO with the approach you selected.
The recommended approach to this validation is to test your DR strategy on a regular basis. You may even decide to test your strategy at a relatively high frequency, for instance on a weekly or bi-weekly basis, starting from the principle that the things that you repeat often become things that you do well.
In many cases, testing your strategy can be straightforward. In some cases, you may even not have to do anything (think of an active-active scenario). In all active-passive scenarios, testing can also be straightforward, provided that you have automated most of the steps that need to happen in case of a disaster in the primary region.
Even in the case of a backup and recovery approach, if you adopted AWS best practices to leverage IaC and automated release build and deployment, spinning off a test environment should be straightforward. You can then restore your backups there to assess RTO and RPO capabilities and run some checks on data content and integrity.