Important note 2 – Amazon SageMaker Modeling – MLS-C01 Study Guide

Important note

Please note that you are using other variables in this configuration file, bucket and prefix, which should be replaced by your bucket name and prefix key (if needed), respectively. You are also referring to s3_input_train and s3_input_validation, which are two variables that point to the train and validation datasets in S3.

Once you have set your configurations, you can spin up the tuning process:

smclient.create_hyper_parameter_tuning_job(

     HyperParameterTuningJobName = “my-tuning-example”,

     HyperParameterTuningJobConfig = tuning_job_config,

     TrainingJobDefinition = training_job_definition

)

Next, let’s find out how to track the execution of this process.

Tracking your training jobs and selecting the best model

Once you have started the tuning process, there are two additional steps that you might want to check: tracking the process of tuning and selecting the winner model (that is, the one with the best set of hyperparameters).

In order to find your training jobs, you should go to the SageMaker console and navigate to Hyperparameter training jobs. You will then find a list of executed tuning jobs, including yours:

Figure 9.11 – Finding your tuning job

If you access your tuning job, by clicking under its name, you will find a summary page, which includes the most relevant information regarding the tuning process. On the Training jobs tab, you will see all the training jobs that have been executed:

Figure 9.12 – Summary of the training jobs in the tuning process

Finally, if you click on the Best training job tab, you will find the best set of hyperparameters for your model, including a handy button for creating a new model based on those best hyperparameters that have just been found:

Figure 9.13 – Finding the best set of hyperparameters

As you can see, SageMaker is very intuitive, and once you know the main concepts behind model optimization, playing with SageMaker should be easier. Now, you understand how to use SageMaker for our specific needs. In the next section, you will explore how to select the instance type for various use cases and the security of our notebooks.

Choosing instance types in Amazon SageMaker

SageMaker uses a pay-for-usage model. There is no minimum fee for it.

When you think about instances on SageMaker, it all starts with an EC2 instance. This instance is responsible for all your processing. It’s a managed EC2 instance. These instances won’t show up in the EC2 console and cannot be SSHed either. The names of this instance type start with ml.

SageMaker offers instances of the following families:

  • The t family: This is the burstable CPU family. With this family, you get a balanced ratio of CPU and memory. This means that if you have a long-running training job, then you lose performance over time as you spend the CPU credits. If you have very small jobs, then they are cost-effective. For example, if you want a notebook instance to launch training jobs, then this family is the most appropriate and cost-effective.
  • The m family: In the previous family, you saw that CPU credits are consumed faster due to their burstable nature. If you have a long-running ML job that requires constant throughput, then this is the right family. It comes with a similar CPU and memory ratio as the t family.
  • The r family: This is a memory-optimized family. When do you need this? Well, imagine a use case where you have to load the data in memory and do some data engineering on the data. In this scenario, you will require more memory and your job will be memory-optimized.
  • The c family: c-family instances are compute-optimized. This is a requirement for jobs that need higher compute power and less memory to store the data. If you refer to the following table, c5.2x large has 8 vCPU and 16 GiB memory, which makes it compute-optimized with less memory. For example, if a use case needs to be tested on fewer records and it is compute savvy, then this instance family is the to-go option to get some sample records from a huge DataFrame and test your algorithm.
  • The p family: This is a GPU family that supports accelerated computing jobs such as training and inference. Notably, p-family instances are ideal for handling large, distributed training jobs that result in less time required for training and are thus much more cost-effective. The p3/p3dn GPU compute instance can go up to 1 petaFLOP per second compute with up to 256 GB of GPU memory and 100 Gbps (gigabits) of networking with 8x NVIDIA v100 GPUs. They are highly optimized for training and are not fully utilized for inference.
  • The g family: For cost-effective, small-scale training jobs, g-family GPU instances are ideal. G4 has the lowest cost per inference for GPU instances. It uses T4 NVIDIA GPUs. The G4 GPU compute instance goes up to 520 TeraFLOPs of compute time with 8x NVIDIA T4 GPUs. This instance family is the best for simple networks.

In the following table, you have a visual comparison between the CPU and memory ratio of 2x large instance types from each family:

t3.2x largem5.2x larger5.2x largec5.2x largep3.2x largeg4dn.2x large
8 vCPU, 32 GiB8 vCPU, 32 GiB8 vCPU, 64 GiB8 vCPU, 16 GiB8 vCPU, 61 GiB8 vCPU, 32 GiB

Table 9.1 – A table showing the CPU and memory ratio of different instance types