A system is said to be idempotent if a given request made multiple times leads to the same result as that same request made exactly once. Idempotency facilitates failure handling since, upon a request failure, you can retry the request without taking much care whether it is retried once or multiple times. You can implement such a mechanism relatively easily using an idempotency token, a unique identifier that remains the same even if the request is repeated multiple times. The component receiving the request with that idempotency token should then ensure to provide the same response to repeated identical requests.
That’s a powerful mechanism to avoid duplicating processing and data, for instance, in the case of transactions.
This anti-fragility best practice consists of limiting the variance of work done across time, independent of the status of the system. It is based on the premise that variance causes disruption: a system has more chances to fail when it is subject to rapid and important variations in load. To have a better understanding of this idea, read the Reliability and Constant Work paper from the Amazon Builders’ Library, which is referenced in the Further Reading section at the end of this chapter. It explains very well how Amazon Route 53 or AWS Hyperplane (the network function virtualization platform underpinning several AWS services) leverages this technique internally to make sure that their critical components will not fail due to changes in load, since they always keep doing the same constant amount of work, day in day out, no matter what.
For a more concrete example, similar to what Route 53 does internally, imagine that a critical component in your workload is in charge of performing health checks on a fleet of servers. The idea of constant work is to design your component and its supporting infrastructure in such a manner that they always do the same job and the same amount of work. For instance, suppose after performing some tests you measure that your health check component can do health checks on 100 servers at a time. Now, if your overall system is made of a fleet of 1,000 servers, you would simply deploy 10 instances of your health check component, each instance performing health checks on 100 servers. But what if your fleet is composed of a number of servers that is not a multiple of 100? If, for instance, you only have a fleet of 250 servers, you would need to deploy three instances of your health check component, each of the three capable of performing health checks on 100 servers. But wait, you only have 250 servers, not 300; how do you split the health checks? At first, it would seem logical to split the 250 servers more or less equally across the three health check component instances.
However, the trick, according to the principle of doing constant work, is to assign 100 servers to the first two health check component instances and the remaining 50 servers to the third health check component instance. To make sure that all your health check component instances keep doing constant work, add dummy data for the 50 missing servers and keep doing health checks on 100 servers, whether they are real or not. The key here is that your health check component keeps performing the same job again and again so that it is not affected by a sudden variation in load.
You just learned how to design interactions between components to prevent failures. However, note that everything will eventually fail over time. So, despite the precautions taken, some failures can occur at any time and you need to make sure that your workload can handle them.