This best practice now applies to the client side, or sending end, of the request.
Set timeouts accordingly depending on your use case when you depend on other components since they can become unhealthy and stop responding, as explained in the preceding section. Also, avoid relying on default values since they may not fit your use case.
A good practice is to define timeouts for all external calls, whether they are made to a local or a remote system. The difficulty lies in finding the right value for the timeout. It shouldn’t be too high to remain useful, but it shouldn’t be too low to avoid seeing an excessive number of timeouts. In the latter case, the risk is generating unnecessary retries that will add stress to the network and/or the server, which would reduce or slow down the flow of responses, causing more timeouts, for instance.
When users or services interact with a workload, the series of interactions they perform is referred to as a session. A session contains user (or service) data that gets persisted between requests. An application is stateless if it does not require knowledge of prior interactions and does not need to store session information.
Stateless services bring major benefits in terms of scalability and reliability since incoming requests can be handled equally by any instance of the service. Plus, they become good candidates to be deployed on serverless compute platforms, such as AWS Lambda or AWS Fargate.
You may wonder how to handle state information in a stateless service. Well, instead of keeping it as memory or on a disk, you instead offload it to another component of your service, such as a cache system (for instance, Amazon ElastiCache) or a database (for instance, Amazon DynamoDB).
You can find more details on how to tackle several techniques discussed in the preceding section in the Timeouts, retries, and backoff with jitter paper from the Amazon Builders’ Library, referenced in the Further Reading section at the end of this chapter.
The next section will talk about why and how you should manage changes to your workload.
Managing changes to your workload is important since any of those changes can potentially affect its resiliency. So, to ensure that it can handle changes without impact, you must anticipate, but also monitor and control, those changes. To clarify further, all kinds of changes are being considered here, whether they are made to your application code or they affect the environment where your workload operates (such as, for instance, a surge in demand or a change of OS).
Monitoring is critical to ensure that you keep an eye on your workload’s behavior at all times. It doesn’t mean that you need to have someone watch over a screen 24/7, thankfully.
Logs and metrics are powerful instruments to provide insight into the health of your workload. Firstly, you should make sure to monitor logs and metrics emitted by your workload. Secondly, you should send notifications when thresholds are crossed or significant events hit your workload. Monitoring enables you to identify when your workload’s SLAs are breached, when some KPIs or other thresholds are crossed, or when failures occur.
First things first, you must be able to instrument your workload to measure and extract metrics and then to measure and evaluate thresholds, KPIs, or SLAs against your objectives. You should also record essential operational metrics, such as latency, request rates, error rates, and success rates. Absolute numbers or average values won’t be sufficient to make an informed decision on the best course of action. Depending on the event occurring, you will likely need to look at various metrics, such as ratios, averages, and percentiles.
How do you go about monitoring on AWS then?
Monitoring on AWS consists of essentially four phases:
The following sections present each of these steps in more detail.