Important note
In this example, you have only set two dimensions for each data point (dimensions x and y). In real use cases, you can see far more dimensions, and that is why clustering algorithms play a very important role in identifying groups in the data in a more automated fashion.
Hopefully, you have enjoyed how to compute K-Means from scratch! This knowledge will be beneficial for the exam and for your career as a data scientist. By the way, as advised many times, data scientists must be skeptical and curious, so you might be wondering why three clusters were defined in this example and not two or four. You may also be wondering how you measure the quality of the clusters.
You didn’t think this explanation wouldn’t be provided, did you?
Defining the number of clusters and measuring cluster quality
Although K-Means is a great algorithm for finding patterns in your data, it will not provide the meaning of each cluster, nor the number of clusters you have to create to maximize cluster quality.
In clustering, cluster quality means that you want to create groups with a high homogeneity among the elements of the same cluster, and a high heterogeneity among the elements of different clusters. In other words, the elements of the same clusters should be close/similar, whereas the elements of different clusters should be well separated.
One way to compute the cluster’s homogeneity is by using a metric known as the sum of square errors, or SSE for short. This metric will compute the sum of squared differences between each data point and its cluster centroid. For example, when all the data points are located at the same point where the cluster centroid is, then the SSE will be 0. In other words, you want to minimize the SSE. The following equation formally defines the SSE:
Now that you know how to check the cluster quality, it is easier to understand how to define the number of appropriate clusters for a given dataset. All you have to do is find the optimal number of clusters to minimize the SSE. A very popular method that works around that logic is known as the elbow method.
The elbow method proposes executing the clustering algorithm many times. In each execution, you will test a different number of clusters, k. After each execution, you compute the SSE related to that k number of clusters. Finally, you can plot these results and select the number of k where the SSE stops to drastically decrease.
Important note
Adding more clusters will naturally decrease the SSE. In the elbow method, you want to find the point where that change becomes smoother, which means that the addition of new clusters will not bring too much value.
In the previous example, three clusters were created. Figure 6.17 shows the elbow analysis that supports this decision.
Figure 6.17 – The elbow method
You can conclude that adding more than three or four clusters will add unnecessary complexity to the clustering process.
Of course, you should always consider the business background while defining the number of clusters. For example, if you are creating a customer segmentation model and your company has prepared the commercial team and business processes to support four segments of customers, there is no harm in setting up four clusters instead of three.
Finally, you should know that AWS has implemented K-Means as part of its list of built-in algorithms. In other words, you don’t have to use external libraries or bring your own algorithm to play with K-Means on AWS.