Automating Recovery for Components Constrained to a Single Location In some cases, it will not be possible to run components of the workload across multiple AZs. For example, all nodes of an Amazon EMR cluster are launched in the same AZ to improve job performance thanks to reduced latency and thus a higher data access […]
Using Fault Isolation to Protect Your Data Fault isolation limits the impact a failure can have on a workload to a limited set of components. This is a similar effect to the blast radius limitation that you want to achieve in terms of security. The idea is always the same: the components located outside of […]
Integrate Functional Testing as Part of Your Deployment This is a standard best practice. Functional tests should be part of your CI/CD pipeline(s). Failing any of those tests should stop the pipeline(s) from deploying any further, and trigger a rollback as needed. Integrate Resiliency Testing as Part of Your Deployment This is a more advanced […]
Obtaining Resources upon Detection of Impairment When you detect that the availability of your workload is impaired, you should scale the necessary resources to make it available again. For that, it’s important that you can detect the health issue and be notified in the first place, for instance, using a canary test. Refer to the […]
Monitoring End-to-End Tracing of Requests through Your System This was slightly touched upon under canary testing. It’s good practice to validate that end-to-end requests perform as expected. Leverage AWS X-Ray, or third-party equivalent tools, to help you understand how your workload and its underlying components are performing. Tracing can also prove particularly useful for debugging, […]
Aggregation – Defining and Calculating Metrics As already mentioned, several AWS services provide service-specific metrics in CloudWatch out of the box. For others, for instance, VPC Flow Logs, you’ll have to define metrics yourself by extracting data directly from the logs in CloudWatch. You do that by creating a metric filter that will look for […]
Generation – Monitoring All Components of Your Workload This may sound obvious, but it is essential to monitor all the components of your workload without exception, using either Amazon CloudWatch or third-party solutions if you prefer. From the frontend to the backend and the storage or database layer, you should make sure to collect the […]
Setting Client Timeouts This best practice now applies to the client side, or sending end, of the request. Set timeouts accordingly depending on your use case when you depend on other components since they can become unhealthy and stop responding, as explained in the preceding section. Also, avoid relying on default values since they may […]
Throttling Requests Throttling is a useful mechanism to respond to an unexpected burst in demand that exceeds a component’s capacity. Some of the requests are still served but those over a specific threshold are rejected with a return message that indicates they have been throttled. Why is it important to mention the reason for the […]
Designing Interactions in a Distributed System to Mitigate or Withstand Failures First, every component in your workload must behave in a way that does not negatively impact other components. Second, every component in your workload must be able to withstand the failure of one or more other components. Now, how can you achieve this? The […]