Normalization and standardization rely on your training data to fit their parameters: minimum and maximum values, in the case of normalization, and mean and standard deviation in the case of standard scaling. This also means you must fit those parameters using only your training data and never the testing data.
However, there are other types of numerical transformations that do not require parameters from training data to be applied. These types of transformations rely purely on mathematical computations. For example, one of these transformations is known as logarithmic transformation. This is a very common type of transformation in ML models and is especially beneficial for skewed features. If you don’t know what a skewed distribution is, take a look at Figure 4.5.
Figure 4.5 – Skewed distributions
In the middle, you have a normal distribution (or Gaussian distribution). On the left- and right-hand sides, you have skewed distributions. In terms of skewed features, there will be some values far away from the mean in one single direction (either left or right). Such behavior will push both the median and mean values of this distribution in the same direction, that of the long tail you can see in Figure 4.5.
One very clear example of data that used to be skewed is the annual salaries of a particular group of professionals in a given region, such as senior data scientists working in Florida, US. This type of variable usually has most of its values close to the others (because people used to earn an average salary) and just has a few very high values (because a small group of people makes much more money than others).
Hopefully, you can now easily understand why the mean and median values will move to the tail direction, right? The big salaries will push them in that direction.
Alright, but why will a logarithmic transformation be beneficial for this type of feature? The answer to this question can be explained by the math behind it:
Figure 4.6 – Logarithmic properties
Computing the log of a number is the inverse of the exponential function. Log transformation will then reduce the scale of your number according to a given base (such as base 2, base 10, or base e, in the case of a natural logarithm). Looking at the salary distribution from the previous example, you would bring all those numbers down so that the higher the number, the higher the reduction; however, you would do this in a log scale and not in a linear fashion. Such behavior will remove the outliers of this distribution (making it closer to a normal distribution), which is beneficial for many ML algorithms, such as linear regression. Table 4.9 shows you some of the differences when transforming a number in a linear scale versus a log scale:
Ordinal value | Linear scale (normalization) | Log scale (base 10) |
10 | 0.0001 | 1 |
1,000 | 0.01 | 3 |
10,000 | 0.1 | 4 |
100,000 | 1 | 5 |
Table 4.9 – Differences between linear transformation and log transformation
As you can see, the linear transformation kept the original magnitude of the data (you can still see outliers, but in another scale), while the log transformation removed those differences of magnitude and still kept the order of the values.
Would you be able to think about another type of mathematical transformation that follows the same behavior of log (making the distribution closer to Gaussian)? OK, here you have another: square root. Take the square root of those numbers shown in Table 4.9 and see yourself!
Now, pay attention to this: both log and square root belong to a set of transformations known as power transformations, and there is a very popular method, which is likely to be mentioned on your AWS exam, that can perform a range of power transformations like those you have seen. This method was proposed by George Box and David Cox and its name is Box-Cox.