Querying S3 data using Athena – AWS Services for Data Migration and Processing – MLS-C01 Study Guide

Querying S3 data using Athena

Athena is a serverless service designed for querying data stored in S3. It is serverless because the client doesn’t manage the servers that are used for computation:

  • Athena uses a schema to present the results against a query on the data stored in S3. You define how (the way or the structure) you want your data to appear in the form of a schema and Athena reads the raw data from S3 to show the results as per the defined schema.
  • The output can be used by other services for visualization, storage, or various analytics purposes. The source data in S3 can be in any of the following structured, semi-structured, or unstructured data formats: XML, JSON, CSV/TSV, AVRO, Parquet, or ORC (as well as others). CloudTrail, ELB logs, and VPC flow logs can also be stored in S3 and analyzed by Athena.
  • This follows the schema-on-read technique. Unlike traditional techniques, tables are defined in advance in a data catalog, and the data’s structure is validated against the table’s schema while reading the data from the tables. SQL-like queries can be carried out on data without transforming the source data.

Now, to help you understand this, here’s an example, where you will use AWSDataCatalog created in AWS Glue on the S3 data and query them using Athena:

  1. Navigate to the AWS Athena console. Select AWSDataCatalog from Data source (if you are doing this for the first time, then a sampledb database will be created with a table, elb_logs, in the AWS Glue Data Catalog).
  2. Select s3-data as the database.
  3. Click on Settings in the top-right corner and fill in the details as shown in Figure 3.8 (I have used the same bucket as in the previous example and a different folder):

Figure 3.8 – A screenshot of Amazon Athena’s settings

  • The next step is to write your query in the query editor and execute it. Once your execution is complete, please delete your S3 buckets and AWS Glue data catalogs. This will save you money.

In this section, you learned how to query S3 data using Amazon Athena through the AWS Glue Data Catalog. You also learned how to create a schema and query data from S3. In the next section, you will learn about Amazon Kinesis Data Streams.