Technical requirements
You can download the data used in the examples from GitHub, available here: https://github.com/PacktPublishing/AWS-Certified-Machine-Learning-Specialty-MLS-C01-Certification-Guide-Second-Edition/tree/main/Chapter03.
Creating ETL jobs on AWS Glue
In a modern data pipeline, there are multiple stages, such as generating data, collecting data, storing data, performing ETL, analyzing, and visualizing. In this section, you will cover each of these at a high level and understand the extract, transform, load (ETL) process in depth:
- Data can be generated from several devices, including mobile devices or IoT, weblogs, social media, transactional data, and online games.
- This huge amount of generated data can be collected by using polling services, through API gateways integrated with AWS Lambda to collect the data, or via streams such as AWS Kinesis, AWS-managed Kafka, or Kinesis Firehose. If you have an on-premises database and you want to bring that data to AWS, then you would choose AWS DMS for that. You can sync your on-premises data to Amazon S3, Amazon EFS, or Amazon FSx via AWS DataSync. AWS Snowball is used to collect/transfer data into and out of AWS.
- The next step involves storing data. You learned about some of the services to do this in the previous chapter, such as S3, EBS, EFS, RDS, Redshift, and DynamoDB.
- Once you know your data storage requirements, an ETL job can be designed to extract-transform-load or extract-load-transform your structured or unstructured data into the format you desire for further analysis. For example, you can use AWS Lambda to transform the data on the fly and store the transformed data in S3, or you can run a Spark application on an EMR cluster to transform the data and store it in S3 or Redshift or RDS.
- There are many services available in AWS for performing an analysis on transformed data, for example, EMR, Athena, Redshift, Redshift Spectrum, and Kinesis Analytics.
- Once the data is analyzed, you can visualize it using AWS QuickSight to understand the patterns or trends. Data scientists or machine learning professionals would want to apply statistical analysis to understand data distribution in a better way. Business users use statistical analysis to prepare reports. You will learn and explore various ways to present and visualize data in Chapter 5, Data Understanding and Visualization.
What you understood from the traditional data pipeline is that ETL is all about coding and maintaining code on the servers so that everything runs smoothly. If the data format changes in any way, then the code needs to be changed, and that results in a change to the target schema. If the data source changes, then the code must be able to handle that too, and it’s an overhead. Should you write code to recognize these changes in data sources? Do you need a system to adapt to the change and discover the data for you? The answer to these questions is yes, and to do so, you can use AWS Glue. Now, you will learn why AWS Glue is so popular.