Throttling is a useful mechanism to respond to an unexpected burst in demand that exceeds a component’s capacity. Some of the requests are still served but those over a specific threshold are rejected with a return message that indicates they have been throttled. Why is it important to mention the reason for the rejection? Because you expect the client to take it into account and drop the request, back off, and maybe try again after a while but at a slower rate.
After designing your workload, you should perform stress tests to determine the request rate that each component can handle. You can then use those metrics to define when to throttle incoming requests. You can leverage Amazon API Gateway for that.
That said, if you can address requests asynchronously, Amazon SQS and Amazon Kinesis can also be used to buffer requests, so you can handle requests at your own pace, avoiding the need for throttling.
What should you do when an error occurs? How should your components behave when receiving errors to requests they’ve made to their dependencies?
The temptation is high, especially when your application is time-constrained, to retry the failed request immediately and to keep retrying several times before dropping the request. There could be multiple reasons for that failure, and retrying immediately and repeatedly several times will inevitably add more stress to the network and the server if the requests ever reach it. This behavior is then amplified by the number of clients retrying at the exact same time.
A better approach is to use exponential backoff to retry after longer intervals; and, to make sure that retries from all clients do not occur at the same time, it is recommended to introduce some jitter. This will effectively randomize the retry intervals. Also, don’t forget to limit the number of retries: if a request doesn’t go through after a series of retries, it may be a sign of a deeper issue, such as incomplete or corrupt data.
Exponential backoff and retry is a technique implemented, for instance, by the AWS SDKs. That said, whenever you use third-party libraries or SDKs, always verify that this works as expected.
This best practice applies to the server side or receiving end of the request.
For any component receiving a request, it is essential to determine rapidly whether it can handle the request or not. If for some reason it is unable to process the request, for instance, due to a lack of resources available, it should fail fast. The idea is to rapidly free resources to take some stress off the component and allow it to recover. As was mentioned already, one technique that can be used to relieve some of the pressure of the incoming requests is to buffer them, in a queue, for instance. That said, it is recommended not to let the queue grow too much since it will potentially result in increased wait time on the client side and, in the worst-case scenario, processing requests that the client has eventually dropped (stale requests). It could even lead to a state where the server tries to catch up with queued, but now stale, requests while fresh requests keep piling up in the queue, where they eventually become stale. That could keep the server from actually fully recovering. So, internal queue depths should be limited and kept under control.