Having automated recovery procedures is good, but making sure they work is better. If you’re new to the cloud, you may be used (in your on-premises environment) to testing your workloads and making sure they work in “normal” conditions, but you may be less used to testing recovery procedures to handle failures.
In the cloud, validating your recovery procedures is as easy as starting a test environment, deploying your workload on it, and carrying out enough tests to simulate various failure types. This will allow you to test, make fixes, or apply changes to your workloads, and to feel more confident that they can handle a real failure when it occurs, thus reducing your risks.
Large monolithic systems that only scale vertically, that is, by adding more resources such as CPU and RAM, can lead to a complete system failure when unexpected events occur. It is thus highly recommended to design systems that can scale horizontally, that is, by replicating some or all of its components. By replacing a large resource with multiple smaller ones, you inherently reduce the impact of a single failure on the overall system. Doing so at every layer of your system (infrastructure elements, communication system, frontend, and backend components) will allow you to get rid of single points of failure and, as a result, improve the overall reliability of your workload.
Resource exhaustion is one of the most natural causes of failure for any workload. It occurs when the system’s capacity is outrun by the demand set on a workload. Excessive demand could result from a genuine peak in the demand or from malicious usage, such as denial-of-service attacks. An on-premises environment requires you to guess your capacity needs to provision the necessary infrastructure upfront. The cloud gives you elasticity for you to scale as much as you need to meet the demand, provided that your design supports it.
So, no more capacity guessing. Instead, you first need to ensure that your application can scale horizontally; that’s most important. Second, you need to watch service quotas to make sure they don’t keep your workload from scaling out. Some service quotas are set on your accounts to protect you from over-provisioning resources. However, in some cases, you may need more resources than your current quotas allow. Most of these quotas can be adjusted, provided that you submit a request to increase them on time. So, it is important to monitor them regularly to anticipate any potential limit being reached.
All changes made to your infrastructure should be automated. This is a matter of being able to make consistent deployments that you can track and repeat at will. Deploying changes manually is error-prone and should be avoided when managing distributed systems at scale. In the cloud, you want the ability to re-create an exact copy of any specific environment at any time, be it for test or business continuity purposes.