Conclusion – Applying Machine Learning Algorithms – MLS-C01 Study Guide

Conclusion

That was a really good accomplishment: you just mastered the basics of clustering algorithms and you should now be able to drive your own projects and research about this topic! For the exam, remember that clustering belongs to the unsupervised field of machine learning, so there is no need to have labeled data.

Also, make sure that you know how the most popular algorithm of this field works – that is, K-Means. Although clustering algorithms do not provide the meaning of each group, they are very powerful for finding patterns in the data, either to model a particular problem or just to explore the data.

Coming up next, you will keep studying unsupervised algorithms and see how AWS has built one of the most powerful algorithms out there for anomaly detection, known as RCF.

Anomaly detection

Finding anomalies in data is a very common task in modeling and data exploratory analysis. Sometimes, you might want to find anomalies in the data just to remove them before fitting a regression model, while other times, you might want to create a model that identifies anomalies as an end goal – for example, in fraud detection systems.

Again, you can use many different methods to find anomalies in the data. With some creativity, the possibilities are endless. However, there is a particular algorithm that works around this problem that you should definitely be aware of for your exam: RCF.

RCF is an unsupervised decision tree-based algorithm that creates multiple decision trees (forests) using random subsamples of the training data. Technically, it randomizes the data and then creates samples according to the number of trees. Finally, these samples are distributed across each tree.

These sets of trees are used to assign an anomaly score to the data points. To calculate the anomaly score for a particular data point, it is passed down each tree in the forest. As the data point moves through the tree, the path length from the root node to the leaf node is recorded for that specific tree. The anomaly score for that data point is then determined by considering the distribution of path lengths across all the trees in the forest.

If a data point follows a short path in most trees (i.e., it is close to the root node), it is considered a common point and will have a lower anomaly score.

On the other hand, if a data point follows a long path in many trees (i.e., it is far from the root node), it is considered an uncommon point and will have a higher anomaly score.

The most important hyperparameters of RCF are num_trees and num_samples_per_tree, which are the number of trees in the forest and the number of samples per tree, respectively.