Detecting incidents is one thing, but being able to respond to them in a timely manner is even more important. Assume that you have put in place the necessary mechanisms to detect and prioritize incidents. What is next?
Next, you want the ability to remediate these incidents. There can be, however, several types of incidents: first, in terms of severity—from minor to major or critical incidents—second, in terms of complexity—from easy-to-fix with a single problem to address to more complex ones caused by multiple intertwined issues. From a priority perspective, you want to address the most critical issues first. How can you tackle all incidents in a timely manner? The solution is a combination of automation—for straightforward issues—and prescriptive guidance.
First, automation is a must-have. It might not look necessary from the perspective of a single solution managed by a single team, but if you consider a security operations team working for a complex organization with tens or hundreds of projects and teams running on AWS in production, automation becomes critical. Incident remediation simply cannot scale if all actions need to be taken manually.
Some incidents will have a single cause and be easy to fix. Think of incidents that are triggered by a violation of your organization’s security principles codified as AWS Config rules. Imagine, for instance, that you prohibit the use of S3 buckets with public access, whether read or write. That is straightforward to fix and shouldn’t require any manual intervention. Once the AWS Config rule violation has been captured, it could trigger a remediation action that, for instance, first notifies the bucket owner and then ,if no corrective action is taken by the owner within a specific timeframe, it automatically blocks all public access to the S3 bucket at hand. This is just an example and should not be followed to the letter as the best remediation course in such a case. The remediation action that you should take depends on your organization’s culture of risk and compliance management, as well as on the type of incident and on the specific context in which it occurs. If you transpose, for instance, the same example in the context of a financial or healthcare institution where the S3 bucket happens to contain sensitive information (such as PII data), a different remediation course of actions will apply, such as immediately blocking all public access and then notifying the bucket owner. The bottom line is you need to adapt the remediation to your own context.
In any case, how can you achieve an effective response strategy on AWS?
First, consider a situation where an incident is caused due to a configuration policy violation. Suppose you have some AWS Config rules activated on your accounts. When one of these rules is violated, AWS Config can trigger a remediation action using AWS Systems Manager Automation runbooks. This remediation action can take place manually or automatically as per your requirements. These runbooks specify the actions to be performed on non-compliant AWS resources. AWS Config comes with a set of predefined managed automation runbooks with remediation actions. You can also create and associate your own custom automation runbooks with AWS Config rules. For the example that is being used here, there is a predefined runbook called AWS-DisableS3BucketPublicReadWrite that could directly be leveraged to fix the issue. This automation runbook expects two parameters: the S3 bucket name and the IAM role that should be used by the runbook to execute the remediation action. The runbook will use these parameters—in this case, provided by the AWS Config rule—to execute a call to the Amazon S3 PutPublicAccessBlock API.
Now, consider a different situation where an incident is caused by a security policy violation. Suppose you have either Amazon GuardDuty or AWS Security Hub set up within your organization. When they report a finding, you can trigger an event based on that finding, using Amazon CloudWatch Events or Amazon EventBridge. Amazon EventBridge extends Amazon CloudWatch Events, building upon the same API and bringing integration with third-party solutions, either as events sources or destinations, events replay, and a schema registry. You can then decide what to do with the event, for instance, whether to push it to an event bus in a different account in a different Region, send it to a Lambda function, call an API destination to send it to a third-party solution, or any other options. The routing decision depends entirely on how you intend to handle the remediation.
In some cases, it will not be easy to trigger remediation immediately. Such cases will usually consist of complex incidents, requiring the intervention of the security operations team to investigate the issue further and decide on the best remediation course. The team should have all the necessary instruments at their disposal to act in a timely manner and take the most efficient course of action. This means that you need to put in place an incident management plan for your solution that will consider the most likely scenarios. The incident management plan should document the communication and escalation paths that should be followed in such cases. It should also document the best responses to issue in each scenario, providing clear step-by-step documentation for the operations team to follow, including code snippets and command lines whenever needed.