Data standardization is another scaling method that transforms the distribution of the data, so that the mean will become 0 and the standard deviation will become 1. Figure 4.4 formally describes this scaling technique, where X represents the value to be transformed, µ refers to the mean of X, and σ is the standard deviation of X:
Figure 4.4 – Standardization formula
Unlike normalization, data standardization will not result in a predefined range of values. Instead, it will transform your data into a standard Gaussian distribution, where your transformed values will represent the number of standard deviations of each value to the mean of the distribution.
The Gaussian distribution, also known as the normal distribution, is one of the most used distributions in statistical models. This is a continuous distribution with two main controlled parameters: µ (mean) and σ (standard deviation). Normal distributions are symmetric around the mean. In other words, most of the values will be close to the mean of the distribution.
Data standardization is often referred to as the zscore and is widely used to identify outliers on your variable, which you will see later in this chapter. For the sake of demonstration, Table 4.7 simulates the data standardization of a small dataset. The input value is shown in the Age column, while the scaled value is shown in the Zscore column:
Age | Mean | Standard deviation | Zscore |
5 | 31,83 | 25,47 | -1,05 |
20 | 31,83 | 25,47 | -0,46 |
24 | 31,83 | 25,47 | -0,31 |
32 | 31,83 | 25,47 | 0,01 |
30 | 31,83 | 25,47 | -0,07 |
80 | 31,83 | 25,47 | 1,89 |
Table 4.7 – Data standardization in action
Make sure you are confident when applying normalization and standardization by hand in the AWS Machine Learning Specialty exam. They might provide a list of values, as well as mean and standard deviation, and ask you for the scaled value of each element in the list.
Binning is a technique where you can group a set of values into a bucket or bin – for example, grouping people between 0 and 14 years old into a bucket named “children,” another group of people between 15 and 18 years old into a bucket named “teenager,” and so on.
Discretization is the process of transforming a continuous variable into discrete or nominal attributes. These continuous values can be discretized by multiple strategies, such as equal-width and equal-frequency.
An equal-width strategy will split your data across multiple bins of the same width. Equal-frequency will split your data across multiple bins with the same number of frequencies.
Look at the following example. Suppose you have the following list containing 16 numbers: 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 90. As you can see, this list ranges between 10 and 90. Assuming you want to create four bins using an equal-width strategy, you could come up with the following bins:
In this case, the width of each bin is the same (20 units), but the observations are not equally distributed. Now, the next example simulates an equal-frequency strategy:
In this case, all the bins have the same frequency of observations, although they have been built with different bin widths to make that possible.
Once you have computed your bins, you should be wondering what’s next, right? Here, you have some options:
Take a look at Table 4.8 to understand these approaches using our equal-frequency example:
Ordinal value | Bin | Transforming to a nominal feature | Transforming to an ordinal feature | Removing noise |
10 | Bin >= 10 <= 13 | Bin A | 1 | 11,5 |
11 | Bin >= 10 <= 13 | Bin A | 1 | 11,5 |
12 | Bin >= 10 <= 13 | Bin A | 1 | 11,5 |
13 | Bin >= 10 <= 13 | Bin A | 1 | 11,5 |
14 | Bin > 13 <= 17 | Bin B | 2 | 15,5 |
15 | Bin > 13 <= 17 | Bin B | 2 | 15,5 |
16 | Bin > 13 <= 17 | Bin B | 2 | 15,5 |
17 | Bin > 13 <= 17 | Bin B | 2 | 15,5 |
18 | Bin > 17 <= 21 | Bin C | 3 | 19,5 |
19 | Bin > 17 <= 21 | Bin C | 3 | 19,5 |
20 | Bin > 17 <= 21 | Bin C | 3 | 19,5 |
21 | Bin > 17 <= 21 | Bin C | 3 | 19,5 |
22 | Bin > 21 <= 90 | Bin D | 4 | 55,5 |
23 | Bin > 21 <= 90 | Bin D | 4 | 55,5 |
24 | Bin > 21 <= 90 | Bin D | 4 | 55,5 |
90 | Bin > 21 <= 90 | Bin D | 4 | 55,5 |
Table 4.8 – Different approaches to working with bins and discretization
Again, playing with different binning strategies will give you different results and you should analyze/test the best approach for your dataset. There is no standard answer here – it is all about data exploration!