Unsupervised learning – Applying Machine Learning Algorithms – MLS-C01 Study Guide

Unsupervised learning

AWS provides several unsupervised learning algorithms for the following tasks:

  • Clustering: K-Means algorithm
  • Dimension reduction: Principal Component Analysis (PCA)
  • Pattern recognition: IP Insights
  • Anomaly detection: The Random Cut Forest (RCF) algorithm

Let us start by talking about clustering and how the most popular clustering algorithm works: K-Means.

Clustering

Clustering algorithms are very popular in data science. Basically, they aim to identify similar groups in a given dataset, also known as clusters. Clustering algorithms belong to the field of non-supervised learning, which means that they do not need a label or response variable to be trained.

This is just fantastic since labeled data is very scarce! However, it comes with some limitations. The main one is that clustering algorithms provide clusters for you, but not the meaning of each cluster. Thus, someone, as a subject matter expert, has to analyze the properties of each cluster to define their meanings.

There are many types of clustering approaches, such as hierarchical clustering and partitional clustering. Inside each approach, you will find several algorithms. However, K-Means is probably the most popular clustering algorithm, and you are likely to come across it in your exam.

When you are playing with K-Means, somehow, you have to specify the number of clusters that you want to create. Then, you have to allocate the data points across each cluster, so that each data point will belong to a single cluster. This is exactly what you should expect as a result at the end of the clustering process!

You need to specify the number of clusters that you want to create and pass this number to the K-Means algorithm. Then, the algorithm will randomly initiate the central point of each cluster (this is known as centroid initialization).

Once you have the centroids of each cluster, all you need to do is assign a cluster to each data point. To do that, you have to use a proximity or distance metric! This book will use the term distance metric.

The distance metric is responsible for calculating the distance between data points and centroids. The data point will belong to the closer cluster centroid, according to the distance metric.

The most popular distance metric is called Euclidean distance and the math behind it is simple; imagine that the points of your dataset are composed of two dimensions, x and y. So, you could consider points a and b as follows:

  • a (x=1, y=1)
  • b (x=2, y=5)

The Euclidean distance between points a and b is given by the following formula, where x1 and y1 refer to the values of point a, and x2 and y2 refer to the values of point b:

. The same function can be generalized by the following equation:

. Once you have completed this process and assigned a cluster for each data point, you

 methods, such as single link, average link, and complete link.

Due to this centroid refreshment, you will have to keep checking the closest cluster for each data point and keep refreshing the centroids, iteratively, until the cluster centroids converge and no cluster reassignment is needed, or the maximum number of allowed iterations is reached.

Alright, the following is a summarization of the components and steps that compose the K-Means method:

  • Centroid initialization, cluster assignment, centroid refreshment, and then redo the last two steps until it converges
  • A distance metric to assign data points to each cluster (in this case, Euclidian distance)
  • A linkage method to recalculate the cluster centroids (for the sake of our demonstration, you will learn about the average linkage)

With these definitions, you are now ready to walk through the following real example, step by step (some support material is also available for your reference).