As already mentioned, several AWS services provide service-specific metrics in CloudWatch out of the box. For others, for instance, VPC Flow Logs, you’ll have to define metrics yourself by extracting data directly from the logs in CloudWatch. You do that by creating a metric filter that will look for the pattern you specify in the log data in CloudWatch Logs.
Maybe you’re also interested in defining your own custom metrics to compute your KPIs and SLAs. Custom CloudWatch metrics can be defined and used to process metrics of any dimension. You can also use math expressions to combine data from multiple CloudWatch metrics.
Once your workload metrics are defined, you can then set alarms to be triggered when a metric threshold is crossed. When this happens you can notify any team who should be aware of the event. For that, you can use Amazon SNS, which can publish a notification message to multiple destinations, such as HTTP endpoints, AWS Lambda, Amazon SQS, Amazon Kinesis Firehose, AWS Chatbot (for delivery to a Slack channel for instance), PagerDuty, email, SMS, or mobile push notification.
Once your event notification is in place, you can use automation on the other end to process the events received. One type of automation that comes immediately to mind is remediation to actually fix any issue behind an event; that’s one possibility but not the only one. If the issue cannot be easily remediated, you could, for instance, automatically open a ticket with your organization’s ticketing system so that the issue is routed and prioritized accordingly.
So far, you’ve collected logs and set up metrics, defined thresholds and created alarms, and set up notifications to the various teams and systems. So, you should be good to go, right?
Well, not quite yet. An often-overlooked part of the monitoring process is data management. What are your data retention needs for monitoring data, including logs?
You can use CloudWatch Logs for data retention, and leverage Amazon CloudWatch Logs Insights to run some queries on your log data. However, you might be interested in transferring your logs to your organization’s existing log management system, such as Splunk or Logstash, for instance. Alternatively, you may also be interested in leveraging AWS analytics services, such as Amazon Athena or Amazon EMR, for instance, to run analytics on your log data using SQL queries or Spark jobs. To do that, you need first to instruct CloudWatch to transfer your logs to Amazon S3. Once on S3, you have complete freedom to use any of the AWS analytics services or your own third-party analytics tools. Another benefit is that you can then leverage the object life cycle management capability of S3 to optimize the associated storage costs, progressively transitioning your logs through the various S3 storage tiers available down to Amazon Glacier for long-term archival and retention.