Data standardization – Data Preparation and Transformation – MLS-C01 Study Guide

Data standardization

Data standardization is another scaling method that transforms the distribution of the data, so that the mean will become 0 and the standard deviation will become 1. Figure 4.4 formally describes this scaling technique, where X represents the value to be transformed, µ refers to the mean of X, and σ is the standard deviation of X:

Figure 4.4 – Standardization formula

Unlike normalization, data standardization will not result in a predefined range of values. Instead, it will transform your data into a standard Gaussian distribution, where your transformed values will represent the number of standard deviations of each value to the mean of the distribution.

Important note

The Gaussian distribution, also known as the normal distribution, is one of the most used distributions in statistical models. This is a continuous distribution with two main controlled parameters: µ (mean) and σ (standard deviation). Normal distributions are symmetric around the mean. In other words, most of the values will be close to the mean of the distribution.

Data standardization is often referred to as the zscore and is widely used to identify outliers on your variable, which you will see later in this chapter. For the sake of demonstration, Table 4.7 simulates the data standardization of a small dataset. The input value is shown in the Age column, while the scaled value is shown in the Zscore column:

AgeMeanStandard deviationZscore
531,8325,47-1,05
2031,8325,47-0,46
2431,8325,47-0,31
3231,8325,470,01
3031,8325,47-0,07
8031,8325,471,89

Table 4.7 – Data standardization in action

Make sure you are confident when applying normalization and standardization by hand in the AWS Machine Learning Specialty exam. They might provide a list of values, as well as mean and standard deviation, and ask you for the scaled value of each element in the list.

Applying binning and discretization

Binning is a technique where you can group a set of values into a bucket or bin – for example, grouping people between 0 and 14 years old into a bucket named “children,” another group of people between 15 and 18 years old into a bucket named “teenager,” and so on.

Discretization is the process of transforming a continuous variable into discrete or nominal attributes. These continuous values can be discretized by multiple strategies, such as equal-width and equal-frequency.

An equal-width strategy will split your data across multiple bins of the same width. Equal-frequency will split your data across multiple bins with the same number of frequencies.

Look at the following example. Suppose you have the following list containing 16 numbers: 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 90. As you can see, this list ranges between 10 and 90. Assuming you want to create four bins using an equal-width strategy, you could come up with the following bins:

  • Bin >= 10 <= 30 > 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24
  • Bin > 30 <= 50 >
  • Bin > 50 <= 70 >
  • Bin > 71 <= 90 > 90

In this case, the width of each bin is the same (20 units), but the observations are not equally distributed. Now, the next example simulates an equal-frequency strategy:

  • Bin >= 10 <= 13 > 10, 11, 12, 13
  • Bin > 13 <= 17 > 14, 15, 16, 17
  • Bin > 17 <= 21 > 18, 19, 20, 21
  • Bin > 21 <= 90 > 22, 23, 24, 90

In this case, all the bins have the same frequency of observations, although they have been built with different bin widths to make that possible.

Once you have computed your bins, you should be wondering what’s next, right? Here, you have some options:

  • You can name your bins and use them as a nominal feature on your model! Of course, as a nominal variable, you should think about applying one-hot encoding before feeding an ML model with this data.
  • You might want to order your bins and use them as an ordinal feature.
  • Maybe you want to remove some noise from your feature by averaging the minimum and maximum values of each bin and using that value as your transformed feature.

Take a look at Table 4.8 to understand these approaches using our equal-frequency example:

Ordinal valueBinTransforming to a nominal featureTransforming to an ordinal featureRemoving noise
10Bin >= 10 <= 13Bin A111,5
11Bin >= 10 <= 13Bin A111,5
12Bin >= 10 <= 13Bin A111,5
13Bin >= 10 <= 13Bin A111,5
14Bin > 13 <= 17Bin B215,5
15Bin > 13 <= 17Bin B215,5
16Bin > 13 <= 17Bin B215,5
17Bin > 13 <= 17Bin B215,5
18Bin > 17 <= 21Bin C319,5
19Bin > 17 <= 21Bin C319,5
20Bin > 17 <= 21Bin C319,5
21Bin > 17 <= 21Bin C319,5
22Bin > 21 <= 90Bin D455,5
23Bin > 21 <= 90Bin D455,5
24Bin > 21 <= 90Bin D455,5
90Bin > 21 <= 90Bin D455,5

Table 4.8 – Different approaches to working with bins and discretization

Again, playing with different binning strategies will give you different results and you should analyze/test the best approach for your dataset. There is no standard answer here – it is all about data exploration!