Storing the training data – Applying Machine Learning Algorithms – MLS-C01 Study Guide

Storing the training data

First of all, you can use multiple AWS services to prepare data for machine learning, such as Elastic MapReduce (EMR), Redshift, Glue, and so on. After preprocessing the training data, you should store it in S3, in a format expected by the algorithm you are using. Table 6.1 shows the list of acceptable data formats per algorithm.

Data formatAlgorithm
Application/x-imageObject detection algorithm, semantic segmentation
Application/x-recordioObject detection algorithm
Application/x-recordio-protobufFactorization machines, K-Means, KNN, latent Dirichlet allocation, linear learner, NTM, PCA, RCF, sequence-to-sequence
Application/jsonlinesBlazingText, DeepAR
Image/.jpegObject detection algorithm, semantic segmentation
Image/.pngObject detection algorithm, semantic segmentation
Text/.csvIP Insights, K-Means, KNN, latent Dirichlet allocation, linear learner, NTM, PCA, RCF, XGBoost
Text/.libsvmXGBoost

Table 6.1 – Data formats that are acceptable per AWS algorithm

As you can see, many algorithms accept Text/.csv format. You should follow these rules if you want to use that format:

  • Your CSV file cannot have a header record.
  • For supervised learning, the target variable must be in the first column.
  • While configuring the training pipeline, set the input data channel as content_type equal to text/csv.
  • For unsupervised learning, set label_size within content_type, as follows: ‘content_type=text/csv;label_size=0’.

Although text/.csv format is fine for many use cases, most of the time, AWS’s built-in algorithms work better with recordIO-protobuf. This is an optimized data format that is used to train AWS’s built-in algorithms, where SageMaker converts each observation in the dataset into a binary representation that is a set of 4-byte floats.

RecordIO-protobuf accepts two types of input modes: pipe mode and file mode. In pipe mode, the data will be streamed directly from S3, which helps optimize storage. In file mode, the data is copied from S3 to the training instance’s store volume.

You are almost ready! Now you can take a quick look at some modeling definitions that will help you understand some more advanced algorithms.

A word about ensemble models

Before you start diving into the algorithms, there is an important modeling concept that you should be aware of – ensemble. The term ensemble is used to describe methods that use multiple algorithms to create a model.

A regular algorithm that does not implement ensemble methods will rely on a single model to train and predict the target variable. That is what happens when you create a decision tree or regression model. On the other hand, algorithms that do implement ensemble methods will rely on multiple models to predict the target variable. In that case, since each of these models might come up with a different prediction for the target variable, ensemble algorithms implement either a voting (for classification models) or averaging (for regression models) system to output the final results. Table 6.2 illustrates a very simple voting system for an ensemble algorithm composed of three models.

TransactionModel AModel BModel CPrediction
1FraudFraudNot FraudFraud
2Not FraudNot FraudNot FraudNot Fraud
3FraudFraudFraudFraud
4Not FraudNot FraudFraudNot Fraud

Table 6.2 – An example of a voting system on ensemble methods

As described before, the same approach works for regression problems, where instead of voting, it could average the results of each model and use that as the outcome.

Voting and averaging are just two examples of ensemble approaches. Other powerful techniques include blending and stacking, where you can create multiple models and use the outcome of each model as a feature for a main model. Looking back at Table 6.2, columns Model A, Model B, and Model C could be used as features to predict the final outcome.

It turns out that many machine learning algorithms use ensemble methods while training, in an embedded way. These algorithms can be classified into two main categories:

  • Bootstrapping aggregation or bagging: With this approach, several models are trained on top of different samples of data. Predictions are then made through the voting or averaging system. The most popular algorithm from this category is known as Random Forest.
  • Boosting: With this approach, several models are trained on top of different samples of the data. One model then tries to correct the error of the next model by penalizing incorrect predictions. The most popular algorithms from this category are stochastic gradient boosting and AdaBoost.

Now that you know what ensemble models are, you can look at some machine learning algorithms that are likely to be present in your exam. Not all of them use ensemble approaches.

The next few sections are split based on AWS algorithm categories, as follows:

  • Supervised learning
  • Unsupervised learning
  • Textual analysis
  • Image processing

Finally, you will have an overview of reinforcement learning on AWS.