Getting hands-on with AWS Glue ETL components – AWS Services for Data Migration and Processing – MLS-C01 Study Guide
Getting hands-on with AWS Glue ETL components
In this section, you will use the Data Catalog components created earlier to build a job. You will start by creating a job:
Navigate to the AWS Glue console and click on Jobs under the ETL section.
Click on the Add Job button and complete the fields as shown here:
Name: s3-glue-redshift
IAM role: Glue-IAM-Role (this is the same role you created in the previous section)
Type: Spark
Glue version: Spark 2.4, Python 3 with improved job start up times (Glue version 2.0)
Leave the other fields as they are and then click on Next.
Select sales_records_csv and click on Next.
Select Change Schema by default and then click Next (at the time of writing this book, machine learning transformations are not supported for Glue 2.0).
Select Create tables in your data target. Choose JDBC as the data store and glue-redshift-connection as the connection. Provide glue-dev as the database name (as created in the previous section) and then click Next.
Next comes the Output Schema Definition page, where you can choose the desired columns to be removed from the target schema. Scroll down and click on Save job and edit script.
You can now see the pipeline being created on the left-hand side of the screen and the suggested code on the right-hand side, as shown in Figure 3.6. You can modify the code based on your requirements. Click on the Run job button. A pop-up window appears, asking you to edit any details that you wish to change.
This is optional. Then, click on the Run job button:
Figure 3.6 – A screenshot of the AWS Glue ETL job
Once the job is successful, navigate to Amazon Redshift and click on Query editor.
Set the database name as glue-dev and then provide the username and password to create a connection.
Select the public schema, and now you can query the table to see the records, as shown in Figure 3.7:
Figure 3.7 – A screenshot of Amazon Redshift’s Query editor
You now understand how to create an ETL job using AWS Glue to copy the data from an S3 bucket to Amazon Redshift. You also queried data in Amazon Redshift using the query editor from the UI console. It is recommended to delete the Redshift cluster and AWS Glue job once you have completed the steps successfully. AWS creates two buckets in your account to store AWS Glue scripts and temporary results from AWS Glue. Delete these as well to save costs. You will use the data catalog created on S3 data in the next section.
In the following section, you will learn about querying S3 data using Athena.