Amazon SageMaker Modeling – MLS-C01 Study Guide

Amazon SageMaker Modeling

In the previous chapter, you learned several methods of model optimization and evaluation techniques. You also learned various ways of storing data, processing data, and applying different statistical approaches to data. So, how can you now build a pipeline for this? Well, you can read data, process data, and build machine learning (ML) models on the processed data. But what if my first ML model does not perform well? Can I fine-tune my model? The answer is yes; you can do nearly everything using Amazon SageMaker. In this chapter, you will walk you through the following topics using Amazon SageMaker:

  • Understanding different instances of Amazon SageMaker
  • Cleaning and preparing data in Jupyter Notebook in Amazon SageMaker
  • Model training in Amazon SageMaker
  • Using SageMaker’s built-in ML algorithms
  • Writing custom training and inference code in SageMaker

Technical requirements

You can download the data used in this chapter’s examples from GitHub at https://github.com/PacktPublishing/AWS-Certified-Machine-Learning-Specialty-MLS-C01-Certification-Guide-Second-Edition/tree/main/Chapter09.

Creating notebooks in Amazon SageMaker

If you are working with ML, then you need to perform actions such as storing data, processing data, preparing data for model training, model training, and deploying the model for inference. They are complex, and each of these stages requires a machine to perform the task. With Amazon SageMaker, life becomes much easier when carrying out these tasks.

What is Amazon SageMaker?

SageMaker provides training instances to train a model using the data and provides endpoint instances to infer by using the model. It also provides notebook instances running on the Jupyter Notebook to clean and understand the data. If you are happy with your cleaning process, then you should store the cleaned data in S3 as part of the staging for training. You can launch training instances to consume this training data and produce an ML model. The ML model can be stored in S3, and endpoint instances can consume the model to produce results for end users.

If you draw this in a block diagram, then it will look similar to Figure 9.1:

Figure 9.1 – A pictorial representation of the different layers of the Amazon SageMaker instances

Now, you will take a look at the Amazon SageMaker console and get a better feel for it. Once you log in to your AWS account and go to Amazon SageMaker, you will see something similar to Figure 9.2:

Figure 9.2 – A quick look at the SageMaker console

There are three different sections in the menu on the left, labeled Notebook, Training, and Inference, that have been expanded in Figure 9.2 so that you can dive in and understand them better.

Notebook has three different options that you can use:

  • Notebook instances: This helps you create, open, start, and stop notebook instances. These instances are responsible for running Jupyter Notebooks. They allow you to choose the instance type based on the workload of the use case. The best practice is to use a notebook instance to orchestrate the data pipeline for processing a large dataset. For example, making a call from a notebook instance to AWS Glue for ETL services or Amazon EMR to run Spark applications. If you are asked to create a secure notebook instance outside AWS, then you need to take care of endpoint security, network security, launching the machine, managing storage on it, and managing Jupyter Notebook applications running on the instance. The user does not need to manage any of these with SageMaker.
  • Lifecycle configurations: This is useful when there is a use case that requires a different library, which is not available in the notebook instances. To install the library, the user will do either pip install or conda install. However, as soon as the notebook instance is terminated, the customization will be lost. To avoid such a scenario, you can customize your notebook instance through a script provided through Lifecycle configurations. You can choose any of the environments present in /home/ec2-user/anaconda3/envs/ and customize the specific environment as required.
  • Git repositories: AWS CodeCommit, GitHub, or any other Git server can be associated with the notebook instance for the persistence of your notebooks. If access is given, then the same notebook can be used by other developers to collaborate and save code in a source-control fashion. Git repositories can either be added separately using this option or they can be associated with a notebook instance during the creation.