When you detect that the availability of your workload is impaired, you should scale the necessary resources to make it available again. For that, it’s important that you can detect the health issue and be notified in the first place, for instance, using a canary test. Refer to the discussion in the Monitoring Workload Resources section for further details. Then, you should try to handle such conditions automatically and scale accordingly.
Ideally, you want to leverage the dynamic scaling mechanisms described earlier, to scale your AWS resources proactively before your workload even becomes impacted. Based on the most relevant metric for your use case (for instance, average CPU utilization), you decide how to scale your resources automatically to satisfy the demand, whenever a specific threshold is breached. You could also leverage predictive scaling or a combination of predictive and dynamic scaling.
Testing is crucial. Perform some load tests on your workload, and do it early enough. You’d hate to find out the day before a new major feature launch that your workload is not ready to sustain the expected load. You can easily spin up a new test environment on AWS manually, but preferably using infrastructure as code and automated deployment. After all, that’s also an opportunity to test that those shiny continuous integration (CI)/continuous deployment (CD) pipelines of yours are properly working. Then, stress test your workload using synthetic load and identify any breaking points.
On top of that, once your workload is running in production, you can run occasional stress tests, during off-peak periods naturally, to validate that it keeps behaving as expected.
Implementing changes requires enough discipline to ensure that the outcome remains fully under control. You want to make sure that the introduced changes, whether in the application code or the runtime environment, are not going to threaten the reliability of your workload.
This is about using standard operating procedures to achieve repeatable, predictable outcomes. Runbooks document a series of steps, whether automated or manual, to be performed. They will include things such as how to patch or upgrade your workload environment.
As much as possible, automate the entire process. The less human interaction, the fewer errors. And don’t forget to document a rollback process in case something goes wrong along the way. That also should be automated as much as possible for the same reason.
AWS Systems Manager is here to assist you with the creation and management of your own runbooks. In particular, AWS Systems Manager Automation lets you automate common maintenance or deployment tasks for AWS services such as EC2, RDS, and S3. For instance, it offers several runbooks managed by AWS, with predefined steps that can be used to perform common tasks such as restarting or resizing EC2 instances, or creating an Amazon Machine Image (AMI). You can naturally create your own custom runbooks, leveraging any of the available predefined steps and augmenting them with your own scripts.