Computing K-Means step by step
In this example, you will simulate K-Means in a very small dataset, with only two columns (x and y) and six data points (A, B, C, D, E, and F), as defined in Table 6.8.
Point | x | y |
A | 1 | 1 |
B | 2 | 2 |
C | 5 | 5 |
D | 5 | 6 |
E | 1 | 5 |
F | 2 | 6 |
Cluster 1 | 1 | 1 |
Cluster 2 | 2 | 2 |
Cluster 3 | 5 | 5 |
Table 6.8 – Iteration input data for K-Means
Table 6.8 contains three clusters with the following centroids: (1,1), (2,2), (5,5). The number of clusters (3) was defined a priori and the centroid for each cluster was randomly defined. Figure 6.13 shows the stage of the algorithm that you are at right now.
Figure 6.13 – Plotting the K-Means results before completing the first iteration
Here, you can’t see points A, B, and C since they overlap with cluster centroids, but don’t worry – they will appear soon. Next, you have to compute the distance of each data point to each cluster centroid, and then, you need to choose the cluster that is the closest to each point.
xc1 | yc1 | xc2 | yc2 | xc3 | yc3 | distance-c1 | distance-c2 | distance-c3 | Cluster | |
1 | 1 | 2 | 2 | 5 | 5 | 0,0 | 1,4 | 5,7 | Cluster 1 | |
1 | 1 | 2 | 2 | 5 | 5 | 1,4 | 0,0 | 4,2 | Cluster 2 | |
1 | 1 | 2 | 2 | 5 | 5 | 5,7 | 4,2 | 0,0 | Cluster 3 | |
1 | 1 | 2 | 2 | 5 | 5 | 6,4 | 5,0 | 1,0 | Cluster 3 | |
1 | 1 | 2 | 2 | 5 | 5 | 4,0 | 3,2 | 4,0 | Cluster 2 | |
1 | 1 | 2 | 2 | 5 | 5 | 5,1 | 4,0 | 3,2 | Cluster 3 | |
Legendxc1 = x value of cluster 1yc1 = y value of cluster 1 |
Table 6.9 – Processing iteration 1
Table 6.9 contains the following elements:
Looking at data point A (first row), you can see that it was assigned to cluster 1 because the distance from data point A to cluster 1 is 0 (do you remember that they were overlapping?). The same calculation happens to all other data points to define a cluster for each data point.
Before you move on, you might want to see how those Euclidian distances between the clusters and the data points were computed. For demonstration purposes, the following simulation will consider the distance from data point A to cluster 3 (the first row in Table 6.9, column distance-c3, value 5,7).
First of all, the following formula was used to calculate the Euclidian distance:
Here, you have the following:
Figure 6.14 applies the formula, step by step.
Figure 6.14 – Computing the Euclidian distance step by step
That is just fantastic, isn’t it? You have almost completed the first iteration of K-Means. In the very last step of iteration 1, you have to refresh the cluster centroids. Remember: initially, they were randomly defined, but now, you have just assigned some data points to each cluster, which means you should be able to identify where the central point of the cluster is.
In this example, the linkage method will be used to refresh the cluster centroids. This is a very simple step, and the results are presented in Table 6.10.
Point | x | y |
A | 1 | 1 |
B | 2 | 2 |
C | 5 | 5 |
D | 5 | 6 |
E | 1 | 5 |
F | 2 | 6 |
Cluster 1 | 1 | 1 |
Cluster 2 | 1,5 | 3,5 |
Cluster 3 | 4 | 5,7 |
Table 6.10 – K-Means results after iteration 1
Table 6.10 shows the same data points (A to F) that you are dealing with (by the way, they will never change), and the centroids of clusters 1, 2, and 3. Those centroids are quite different from what they were initially, as shown in Table 6.8. This is because they were refreshed using average linkage! The method got the average value of all the x and y values of the data points of each cluster. In the next simulation, have a look at how (1.5, 3.5) were obtained as centroids of cluster 2.
If you look at Table 6.9, you will see that cluster 2 only has two data points assigned to it: B and E. These are the second and fifth rows in that figure. If you take the average values of the x axis of each point, then you will have (2 + 1) / 2 = 1.5 and (2 + 5) / 2 = 3.5.
With that, you are done with iteration 1 of K-Means and you can view the results in Figure 6.15.
Figure 6.15 – Plotting the K-Means results after the first iteration
Now, you can see almost all the data points, except for data point A because it is still overlapping with the centroid of cluster 1. Moving on, you have to redo the following steps:
You do those two tasks many times until the cluster centroids converge and they don’t change anymore, or you reach the maximum number of allowed iterations, which can be set as a hyperparameter of K-Means. For demonstration purposes, after four iterations, your clusters will look like Figure 6.16.
Figure 6.16 – Plotting the K-Means results after the fourth iteration
On the fourth iteration, all the cluster centroids look pretty consistent, and you can clearly see that all data points could be grouped according to their proximity.