With a grasp on what you are responsible for from an AWS customer perspective, you can now turn to the pillars that will be tested in the exam. The first pillar is incident response (IR). Knowing how to prepare and then react, in both a manual and an automated fashion, when something occurs in one of your AWS accounts is necessary—not only from the exam perspective but also in real life.
As you will see in this chapter, preparation is crucial to IR. This includes gathering the correct team members responsible for participating in any IR activities. Preparation also includes creating (and testing) runbooks and playbooks that can help team members know the exact set of instructions to follow and cut down on the response time in the event of an incident. Further, enabling the correct set of logs and visibility services so that you and your team can construct monitoring mechanisms and alerts for abnormal activity are all part of the pre-incident process.
Unfortunately, it is not possible to stop all security events from arising. As technology changes, new vulnerabilities, threats, and risks are introduced. Combine that with human error, and incidents will undoubtedly occur. Because of these factors, there is a need to implement an IR policy and various associated processes.
The following main topics will be covered in this chapter:
There is a requirement to understand AWS and networking concepts, and you will need access to the AWS Management Console and an active AWS account to follow along with any of the step-by-step guides presented in this chapter.
The goals of IR can be broken down into short-term and long-term goals. Ultimately, you want to be in a position where you no longer have to engage in IR. A short-term goal for an organization may be to ensure that all the logging is in place and notification systems are enabled in case of an incident. Long-term goals may take the form of compiling scripted playbooks with detailed steps so that new team members can quickly and efficiently respond to an incident or, better yet, prepare automated responses. For instance, services such as Systems Manager documents and Lambda functions that trigger automatically based on items found in logs mean no person needs to respond. The response happens before anyone can even turn on their computer.
It all begins with having a plan. A playbook with scripted steps that you or other team members can follow can relieve the stress of an event. An automated runbook or predefined templates (such as CloudFormation templates) can help you recover. Having such items already developed and tested can help shorten the event’s time.
Figure 4.1: The Incident response process continual loop
In Figure 4.1, you see the IR cycle and how it can lead to shorter incidents, which is the goal of any IR. As you begin detection, you can respond to the abnormality detected. After the incident has been contained, you can recover and bring your account back to a steady running state. Once things are running normally, you can take time to learn and improve from the previous incident, noting any deficiencies that might have been encountered or steps performed that could be automated in the future. You also may notice that the information you are gathering needs to be more efficient and may need to turn on additional logging or that one or more of the instructions in the runbook must be corrected.
All of this can be corrected in the Prepare stage so that the IR team and systems are running as optimally as possible, and the cycle of the following incident becomes shorter.
As you will see later in this chapter, the AWS System Manager service has several tools that can help you from a technology perspective regarding the operational side of IR.
Now that you understand the goals of IR, take a look at some of the best practices that AWS has come up with in the area of the AWS Well-Architected Framework (WAF).