Processing stored data on AWS – AWS Services for Data Migration and Processing – MLS-C01 Study Guide

Processing stored data on AWS

There are several services for processing the data stored in AWS. You will learn about AWS Batch and AWS Elastic MapReduce (EMR) in this section. EMR is a product from AWS that primarily runs MapReduce jobs and Spark applications in a managed way. AWS Batch is used for long-running, compute-heavy workloads.

AWS EMR

EMR is a managed implementation of Apache Hadoop provided as a service by AWS. It includes other components of the Hadoop ecosystem, such as Spark, HBase, Flink, Presto, Hive, and Pig. You will not need to learn about these in detail for the certification exam, but here’s some information about EMR:

  • EMR clusters can be launched from the AWS console or via the AWS CLI with a specific number of nodes. The cluster can be a long-term cluster or an ad hoc cluster. In a long-running traditional cluster, you have to configure the machines and manage them yourself. If you have jobs that need to be executed faster, then you need to manually add a cluster. In the case of EMR, these admin overheads disappear. You can request any number of nodes from EMR and it manages and launches the nodes for you. If you have autoscaling enabled on the cluster, EMR regulates nodes according to the requirement. That means, EMR launches new nodes in the cluster when the load is high and decommissions the nodes once the load is reduced.
  • EMR uses EC2 instances in the background and runs in one Availability Zone in a VPC. This enables faster network speeds between the nodes. AWS Glue uses EMR clusters in the background, where users do not need to worry about having an operational understanding of AWS EMR.
  • From a use case standpoint, EMR can be used to process or transform the data stored in S3 and output data to be stored in S3. EMR uses nodes (EC2 instances) as the computing units for data processing. EMR nodes come in different variants, including master nodes, core nodes, and task nodes.
  • The EMR master node acts as a Hadoop NameNode and manages the cluster and its health. It is responsible for distributing the job workload among the other core nodes and task nodes. If you have SSH enabled, then you can connect to the master node instance and access the cluster.
  • An EMR cluster can have one or more core nodes. If you relate it to the Hadoop ecosystem, then core nodes are similar to Hadoop data nodes for HDFS and they are responsible for running the tasks within them.
  • Task nodes are optional and they don’t have HDFS storage. They are responsible for running tasks. If a task node fails for some reason, then this does not impact HDFS storage, but a core node failure causes HDFS storage interruptions.
  • EMR has a filesystem called EMRFS. It is backed by S3, which makes it regionally resilient. If a core node fails, the data is still safe in S3. HDFS is efficient in terms of I/O and faster than EMRFS.

In the following section, you will learn about AWS Batch, which is a managed batch-processing compute service that can be used for long-running services.