You will learn about these algorithms, along with the appropriate details, in the later chapters of this book. For instance, you can look at entropy and information gain as two types of metrics used by decision trees to check feature importance. Knowing the predictive power of each feature helps the algorithm define the optimal root, intermediaries, and leaf nodes of the tree.
Take a moment and use the following example to understand why data normalization will help those types of algorithms. You already know that the goal of a clustering algorithm is to find groups or clusters in your data, and one of the most used clustering algorithms is known as k-means.
Figure 4.2 shows how different scales of the variable could change the hyper plan’s projection of k-means clustering:
Figure 4.2 – Plotting data of different scales in a hyper plan
On the left-hand side of Figure 4.2, you can see a single data point plotted in a hyper plan that has three dimensions (x, y, and z). All three dimensions (also known as features) were normalized to the scale of 0 and 1. On the right-hand side, you can see the same data point, but this time, the x dimension was not normalized. You can clearly see that the hyper plan has changed.
In a real scenario, you would have far more dimensions and data points. The difference in the scale of the data would change the centroids of each cluster and could potentially change the assigned clusters of some points. This same problem will happen on other algorithms that rely on distance calculation, such as KNN.
Other algorithms, such as neural networks and linear regression, will compute weighted sums using your input data. Usually, these types of algorithms will perform operations such as W1*X1 + W2*X2 + Wi*Xi, where Xi and Wi refer to a particular feature value and its weight, respectively. Again, you will learn details of neural networks and linear models later, but can you see the data scaling problem by just looking at the calculations that were just described? It can easily come up with very large values if X (feature) and W (weight) are large numbers. That will make the algorithm’s optimizations much more complex. In neural networks, this problem is known as gradient exploding.
You now have a very good understanding of the reasons you should apply data normalization (and when you should not). Data normalization is often implemented in ML libraries as Min Max Scaler. If you find this term in the exam, then remember that it is the same as data normalization.
Additionally, data normalization does not necessarily need to transform your feature into a range between 0 and 1. In reality, you can transform the feature into any range you want. Figure 4.3 shows how normalization is formally defined.
Figure 4.3 – Normalization formula
Here, Xmin and Xmax are the lower and upper values of the range; X is the value of the feature. Apart from data normalization, there is another very important technique regarding numerical transformations that you must be aware of, not only for the exam but also for your data science career. You’ll look at this in the next section.