Features of AWS Glue – AWS Services for Data Migration and Processing – MLS-C01 Study Guide

Features of AWS Glue

AWS Glue is a completely managed serverless ETL service on AWS. It has the following features:

  • It automatically discovers and categorizes your data by connecting to the data sources and generates a data catalog.
  • Services such as Amazon Athena, Amazon Redshift, and Amazon EMR can use the data catalog to query the data.
  • AWS Glue generates the ETL code, which is an extension to Spark in Python or Scala, which can be modified, too.
  • It scales out automatically to match your Spark application requirements for running the ETL job and loading the data into the destination.

AWS Glue has the Data Catalog, and that’s the secret to its success. It helps with discovering data from data sources and understanding a bit about it:

  • The Data Catalog automatically discovers new data and extracts schema definitions. It detects schema changes and version tables. It detects Apache Hive-style partitions on Amazon S3.
  • The Data Catalog comes with built-in classifiers for popular data types. Custom classifiers can be written using Grok expressions. The classifiers help to detect the schema.
  • Glue crawlers can be run ad hoc or in a scheduled fashion to update the metadata in the Glue Data Catalog. Glue crawlers must be associated with an IAM role with sufficient access to read the data sources, such as Amazon RDS, Redshift, and S3.

As you now have a brief idea of what AWS Glue is used for, move on to run the following example to get your hands dirty.

Getting hands-on with AWS Glue Data Catalog components

In this example, you will create a job to copy data from S3 to Redshift by using AWS Glue. All my components were created in the us-east-1 region. Start by creating a bucket:

  1. Navigate to the AWS S3 console and create a bucket. I have named the bucket aws-glue-example-01.
  2. Click on Create Folder and name it input-data.
  3. Navigate inside the folder and click on the Upload button to upload the sales-records.csv dataset. The data is available in the following GitHub location: https://github.com/PacktPublishing/AWS-Certified-Machine-Learning-Specialty-MLS-C01-Certification-Guide-Second-Edition/tree/main/Chapter03/AWS-Glue-Demo/input-data.

As you have the data uploaded in the S3 bucket, now create a VPC in which you will create your Redshift cluster.

  • Navigate to the VPC console by accessing the https://console.aws.amazon.com/vpc/home?region=us-east-1# URL and click on Endpoints on the left-hand side menu. Click on Create Endpoint and then fill in the fields as shown here:
    • Service Category: AWS servicesSelect a service: com.amazonaws.us-east-1.s3 (gateway type)
    • VPC: Select the Default VPC (use this default VPC in which your Redshift cluster will be created)
  • Leave the other fields as is and click on Create Endpoint.
  • Click on Security Groups from the VPC console. Give a name to your security group, such as redshift-self, and choose the default VPC drop-down menu. Provide an appropriate description, such as Redshift Security Group. Click on Create security group.
  • Click on the Actions dropdown and select Edit Inbound rules. Click on Add rule and complete the fields as shown here:
    • Type: All traffic
    • Source: Custom
    • In the search field, select the same security group (redshift-self)
  • Click on Save Rules.

Now, create your Redshift cluster.

  • Navigate to the Amazon Redshift console. Click on Create Cluster and complete the highlighted fields, as shown in Figure 3.1:

Figure 3.1 – A screenshot of Amazon Redshift’s Create cluster screen

  1. Scroll down and fill in the highlighted fields shown in Figure 3.2 with your own values:

Figure 3.2 – A screenshot of an Amazon Redshift cluster’s Database configurations section

  1. Scroll down and change the Additional configurations settings, as shown in Figure 3.3:

Figure 3.3 – A screenshot of an Amazon Redshift cluster’s Additional configurations section

  1. Change the IAM permissions too, as shown in Figure 3.4:

Figure 3.4 – A screenshot of an Amazon Redshift cluster’s Cluster permissions section

  1. Scroll down and click on Create Cluster. It will take a minute or two to get the cluster in the available state.

Next, you will create an IAM role.

  1. Navigate to the AWS IAM console and select Roles in the Access Management section on the screen.
  2. Click on the Create role button and choose Glue from the services. Click on the Next: permissions button to navigate to the next page.
  3. Search for AmazonS3FullAccess and select it. Then, search for AWSGlueServiceRole and select it. As you are writing your data to Redshift as part of this example, select AmazonRedshiftFullAccess. Click on Next: Tags, followed by the Next: Review button.
  4. Provide a name, Glue-IAM-Role, and then click on the Create role button. The role appears as shown in Figure 3.5:

Figure 3.5 – A screenshot of the IAM role

Now, you have the input data source and the output data storage handy. The next step is to create the Glue crawler from the AWS Glue console.

  1. Select Connections under Databases. Click on the Add connection button and complete the fields as shown here:
    1. Connection name: glue-redshift-connection
    1. Connection type: Amazon Redshift
  2. Click on Next and then fill in the fields as shown here:
    1. Cluster: redshift-glue-example
    1. Database name: glue-dev
    1. Username: awsuser
    1. Password: ******** (enter the value chosen in step 10)
  3. Click on Next and then Finish. To verify that it’s working, click on Test Connection, select Glue-IAM-Role in the IAM role section, and then click Test Connection.
  4. Go to Crawler and select Add Crawler. Provide a name for the crawler, S3-glue-crawler, and then click Next. On the Specify crawler source type page, leave everything as their default settings and then click Next.
  5. On the Add a data store page, choose Include path option and enter s3://aws-glue-example-01/input-data/sales-records.csv.
  6. Click Next.
  7. Set Add another datastore to No. Click Next.
  8. For Choose an existing IAM Role, set Glue-IAM-Role. Then, click Next.
  9. Set Frequency to Run on demand. Click Next.
  10. No database has been created, so click on Add database, provide a name, s3-data, click Next, and then click Finish.
  11. Select the crawler, S3-glue-crawler, and then click on Run Crawler. Once the run is complete, you can see that there is a 1 in the Tables Added column. This means that a database, s3-data, has been created, as mentioned in the previous step, and a table has been added. Click on Tables and select the newly created table, sales_records_csv. You can see that the schema has been discovered now. You can change the data type if the inferred data type does not meet your requirements.

In this hands-on section, you learned about database tables, database connections, crawlers in S3, and the creation of a Redshift cluster. In the next hands-on section, you will learn about creating ETL jobs using Glue.