Applying Machine Learning Algorithms – MLS-C01 Study Guide

In the previous chapter, you learned about understanding data and visualization. It is now time to move on to the modeling phase and study machine learning algorithms! In the earlier chapters, you learned that building machine learning models requires a lot of knowledge about AWS services, data engineering, data exploration, data architecture, and much more. This time, you will delve deeper into the algorithms that have been introduced and more.

Having a good sense of the different types of algorithms and machine learning approaches will put you in a very good position to make decisions during your projects. Of course, this type of knowledge is also crucial to the AWS Certified Machine Learning Specialty exam.

Bear in mind that there are thousands of algorithms out there. You can even propose your own algorithm for a particular problem. In this chapter, you will learn about the most relevant ones and, hopefully, the ones that you will probably face in the exam.

The main topics of this chapter are as follows:

Storing the training data
A word about ensemble models
Supervised learning:
Regression models
Classification models
Forecasting models
Object2Vec
Unsupervised learning:
Clustering
Anomaly detection
Dimensionality reduction
IP Insights
Textual analysis (natural language processing)
Image processing
Reinforcement learning

Alright, grab a coffee and rock it!

Introducing this chapter

During this chapter, you will read about several algorithms, modeling concepts, and learning strategies. All these topics are beneficial for you to know for the exam and throughout your career as a data scientist.

This chapter has been structured in such a way that it not only covers the necessary topics of the exam but also gives you a good sense of the most important learning strategies out there. For example, the exam will check your knowledge regarding the basic concepts of K-Means. However, this chapter will cover it on a much deeper level, since this is an important topic for your career as a data scientist.

The chapter will follow this approach of looking deeper into the algorithms’ logic for some types of models that every data scientist should master. Furthermore, keep this in mind: sometimes you may go deeper than what is expected of you in the exam, but that will be extremely important for you in your career.

Many times during this chapter, you will see the term built-in algorithms. This term will be used to refer to the list of algorithms implemented by AWS on their SageMaker SDK.

Here is a concrete example: you can use scikit-learn’s K-nearest neighbors algorithm, or KNN for short (if you don’t remember what scikit-learn is, refresh your memory by going back to Chapter 1, Machine Learning Fundamentals) to create a classification model and deploy it to SageMaker. However, AWS also offers its own implementation of the KNN algorithm on its SDK, which is optimized to run in the AWS environment. Here, KNN is an example of a built-in algorithm.

The possibilities on AWS are endless because you can either take advantage of built-in algorithms or bring in your own algorithm to create models on SageMaker. Finally, just to make this very clear, here is an example of how to import a built-in algorithm from the AWS SDK:

import sagemaker

knn = sagemaker.estimator.Estimator(get_image_uri(boto3.Session().region_name, “knn”),

get_execution_role(),

train_instance_count=1,

train_instance_type=’ml.m5.2xlarge’,

output_path=output_path,

sagemaker_session=sagemaker.Session())

knn.set_hyperparameters(**hyperparams)

You will learn how to create models on SageMaker in Chapter 9, Amazon SageMaker Modeling. For now, just understand that AWS has its own set of libraries where those built-in algorithms are implemented.

To train and evaluate a model, you need training and testing data. After instantiating your estimator, you should then feed it with those datasets. Not to spoil Chapter 9, Amazon SageMaker Modeling, but you should know about the concept of data channels in advance.

Data channels are configurations related to input data that you can pass to SageMaker when you are creating a training job. You should set these configurations just to inform SageMaker of how your input data is formatted.

In Chapter 9, Amazon SageMaker Modeling, you will learn how to create training jobs and how to set data channels. As of now, you should know that while configuring data channels, you can set a content type (ContentType) and an input mode (TrainingInputMode). You will now take a closer look at how and where the training data should be stored so that it can be integrated properly with AWS’s built-in algorithms.