Computing K-Means step by step – Applying Machine Learning Algorithms – MLS-C01 Study Guide

Computing K-Means step by step

In this example, you will simulate K-Means in a very small dataset, with only two columns (x and y) and six data points (A, B, C, D, E, and F), as defined in Table 6.8.

Pointxy
A11
B22
C55
D56
E15
F26
Cluster 111
Cluster 222
Cluster 355

Table 6.8 – Iteration input data for K-Means

Table 6.8 contains three clusters with the following centroids: (1,1), (2,2), (5,5). The number of clusters (3) was defined a priori and the centroid for each cluster was randomly defined. Figure 6.13 shows the stage of the algorithm that you are at right now.

Figure 6.13 – Plotting the K-Means results before completing the first iteration

Here, you can’t see points A, B, and C since they overlap with cluster centroids, but don’t worry – they will appear soon. Next, you have to compute the distance of each data point to each cluster centroid, and then, you need to choose the cluster that is the closest to each point.

xc1yc1xc2yc2xc3yc3distance-c1distance-c2distance-c3Cluster
1122550,01,45,7Cluster 1
1122551,40,04,2Cluster 2
1122555,74,20,0Cluster 3
1122556,45,01,0Cluster 3
1122554,03,24,0Cluster 2
1122555,14,03,2Cluster 3
Legendxc1 = x value of cluster 1yc1 = y value of cluster 1

Table 6.9 – Processing iteration 1

Table 6.9 contains the following elements:

  • Each row represents a data point.
  • The first six columns represent the centroid axis (x and y) of each cluster.
  • The next three columns represent the distance of each data point to each cluster centroid.
  • The last column represents the clusters that are the closest to each data point.

Looking at data point A (first row), you can see that it was assigned to cluster 1 because the distance from data point A to cluster 1 is 0 (do you remember that they were overlapping?). The same calculation happens to all other data points to define a cluster for each data point.

Before you move on, you might want to see how those Euclidian distances between the clusters and the data points were computed. For demonstration purposes, the following simulation will consider the distance from data point A to cluster 3 (the first row in Table 6.9, column distance-c3, value 5,7).

First of all, the following formula was used to calculate the Euclidian distance:

Here, you have the following:

  • x1 = x of data point A = 1
  • y1 = y of data point A = 1
  • x2 = x of cluster 3 = 5
  • y2 = y of cluster 3 = 5

Figure 6.14 applies the formula, step by step.

Figure 6.14 – Computing the Euclidian distance step by step

That is just fantastic, isn’t it? You have almost completed the first iteration of K-Means. In the very last step of iteration 1, you have to refresh the cluster centroids. Remember: initially, they were randomly defined, but now, you have just assigned some data points to each cluster, which means you should be able to identify where the central point of the cluster is.

In this example, the linkage method will be used to refresh the cluster centroids. This is a very simple step, and the results are presented in Table 6.10.

Pointxy
A11
B22
C55
D56
E15
F26
Cluster 111
Cluster 21,53,5
Cluster 345,7

Table 6.10 – K-Means results after iteration 1

Table 6.10 shows the same data points (A to F) that you are dealing with (by the way, they will never change), and the centroids of clusters 1, 2, and 3. Those centroids are quite different from what they were initially, as shown in Table 6.8. This is because they were refreshed using average linkage! The method got the average value of all the x and y values of the data points of each cluster. In the next simulation, have a look at how (1.5, 3.5) were obtained as centroids of cluster 2.

If you look at Table 6.9, you will see that cluster 2 only has two data points assigned to it: B and E. These are the second and fifth rows in that figure. If you take the average values of the x axis of each point, then you will have (2 + 1) / 2 = 1.5 and (2 + 5) / 2 = 3.5.

With that, you are done with iteration 1 of K-Means and you can view the results in Figure 6.15.

Figure 6.15 – Plotting the K-Means results after the first iteration

Now, you can see almost all the data points, except for data point A because it is still overlapping with the centroid of cluster 1. Moving on, you have to redo the following steps:

  • Recalculate the distance between each data point and each cluster centroid and reassign clusters, if needed.
  • Recalculate the cluster centroids.

You do those two tasks many times until the cluster centroids converge and they don’t change anymore, or you reach the maximum number of allowed iterations, which can be set as a hyperparameter of K-Means. For demonstration purposes, after four iterations, your clusters will look like Figure 6.16.

Figure 6.16 – Plotting the K-Means results after the fourth iteration

On the fourth iteration, all the cluster centroids look pretty consistent, and you can clearly see that all data points could be grouped according to their proximity.