First of all, you can use multiple AWS services to prepare data for machine learning, such as Elastic MapReduce (EMR), Redshift, Glue, and so on. After preprocessing the training data, you should store it in S3, in a format expected by the algorithm you are using. Table 6.1 shows the list of acceptable data formats per algorithm.
Data format | Algorithm |
Application/x-image | Object detection algorithm, semantic segmentation |
Application/x-recordio | Object detection algorithm |
Application/x-recordio-protobuf | Factorization machines, K-Means, KNN, latent Dirichlet allocation, linear learner, NTM, PCA, RCF, sequence-to-sequence |
Application/jsonlines | BlazingText, DeepAR |
Image/.jpeg | Object detection algorithm, semantic segmentation |
Image/.png | Object detection algorithm, semantic segmentation |
Text/.csv | IP Insights, K-Means, KNN, latent Dirichlet allocation, linear learner, NTM, PCA, RCF, XGBoost |
Text/.libsvm | XGBoost |
Table 6.1 – Data formats that are acceptable per AWS algorithm
As you can see, many algorithms accept Text/.csv format. You should follow these rules if you want to use that format:
Although text/.csv format is fine for many use cases, most of the time, AWS’s built-in algorithms work better with recordIO-protobuf. This is an optimized data format that is used to train AWS’s built-in algorithms, where SageMaker converts each observation in the dataset into a binary representation that is a set of 4-byte floats.
RecordIO-protobuf accepts two types of input modes: pipe mode and file mode. In pipe mode, the data will be streamed directly from S3, which helps optimize storage. In file mode, the data is copied from S3 to the training instance’s store volume.
You are almost ready! Now you can take a quick look at some modeling definitions that will help you understand some more advanced algorithms.
Before you start diving into the algorithms, there is an important modeling concept that you should be aware of – ensemble. The term ensemble is used to describe methods that use multiple algorithms to create a model.
A regular algorithm that does not implement ensemble methods will rely on a single model to train and predict the target variable. That is what happens when you create a decision tree or regression model. On the other hand, algorithms that do implement ensemble methods will rely on multiple models to predict the target variable. In that case, since each of these models might come up with a different prediction for the target variable, ensemble algorithms implement either a voting (for classification models) or averaging (for regression models) system to output the final results. Table 6.2 illustrates a very simple voting system for an ensemble algorithm composed of three models.
Transaction | Model A | Model B | Model C | Prediction |
1 | Fraud | Fraud | Not Fraud | Fraud |
2 | Not Fraud | Not Fraud | Not Fraud | Not Fraud |
3 | Fraud | Fraud | Fraud | Fraud |
4 | Not Fraud | Not Fraud | Fraud | Not Fraud |
Table 6.2 – An example of a voting system on ensemble methods
As described before, the same approach works for regression problems, where instead of voting, it could average the results of each model and use that as the outcome.
Voting and averaging are just two examples of ensemble approaches. Other powerful techniques include blending and stacking, where you can create multiple models and use the outcome of each model as a feature for a main model. Looking back at Table 6.2, columns Model A, Model B, and Model C could be used as features to predict the final outcome.
It turns out that many machine learning algorithms use ensemble methods while training, in an embedded way. These algorithms can be classified into two main categories:
Now that you know what ensemble models are, you can look at some machine learning algorithms that are likely to be present in your exam. Not all of them use ensemble approaches.
The next few sections are split based on AWS algorithm categories, as follows:
Finally, you will have an overview of reinforcement learning on AWS.