Working with classification models – Applying Machine Learning Algorithms – MLS-C01 Study Guide

Working with classification models

You have been learning what classification models are throughout this book. However, now, you are going to look at some algorithms that are suitable for classification problems. Keep in mind that there are hundreds of classification algorithms out there, but since you are preparing for the AWS Certified Machine Learning Specialty exam, the ones that have been pre-built by AWS will be covered.

You will start with factorization machines. Factorization machines is considered an extension of the linear learner algorithm, optimized to find the relationship between features within high-dimensional sparse datasets.

Important note

A very traditional use case for factorization machines is recommendation systems, where you usually have a high level of sparsity in the data. During the exam, if you are faced with a general-purpose problem (either a regression or binary classification task) where the underlying datasets are sparse, then factorization machines is probably the best answer from an algorithm perspective.

When you use factorization machines in a regression model, the RMSE will be used to evaluate the model. On the other hand, in the binary classification mode, the algorithm will use log loss, accuracy, and F1 score to evaluate results. A deeper discussion about evaluation metrics will be provided in Chapter 7, Evaluating and Optimizing Models.

You should be aware that factorization machines only accepts input data in the recordIO-protobuf format. This is because of the data sparsity characteristic, in which recordIO-protobuf is supposed to do a better job on data processing compared to text/.csv format.

The next built-in algorithm suitable for classification problems is known as K-nearest neighbors, or KNN for short. As the name suggests, this algorithm will try to find the K closest points to the input data and return either of the following predictions:

  • The most repeated class of the K closest points, if it is a classification task
  • The average value of the label of the K closest points, if it is a regression task

KNN is an index-based algorithm because it computes distances between points, assigns indexes for these points, and then stores the sorted distances and their indexes. With that type of data structure, KNN can easily select the top K closest points to make the final prediction. Note that K is a hyperparameter of KNN and should be optimized during the modeling process.

The other AWS built-in algorithm available for general purposes, including classification, is known as eXtreme Gradient Boosting, or XGBoost for short. This is an ensemble, decision tree-based model.

XGBoost uses a set of weaker models (decision trees) to predict the target variable, which can be a regression task, binary class, or multi-class. This is a very popular algorithm and has been used in machine learning competitions by the top performers.

XGBoost uses a boosting learning strategy, in which one model tries to correct the error of the prior model. It carries the name “gradient” because it uses the gradient descent algorithm to minimize the loss when adding new trees.