Important note 4 – Data Preparation and Transformation – MLS-C01 Study Guide

Important note

In the case of categorical variables, you can replace the missing data with the value that has the highest occurrence in your dataset. The same logic of grouping the dataset according to specific features is still applicable.

You can also use more sophisticated methods of imputation, including constructing an ML model to predict the value of your missing data. The downside of these imputation approaches (either by averaging or predicting the value) is that you are making inferences about the data that are not necessarily right and will add bias to the dataset.

To sum this up, the trade-off while dealing with missing data is having a balance between losing data or adding bias to the dataset. Unfortunately, there is no scientific recipe that you can follow, whatever your problem is. To decide on what you are going to do, you must look at your success criteria, explore your data, run experiments, and then make your decisions.

You will now move to another headache for many ML algorithms: outliers.

Dealing with outliers

You are not on this studying journey just to pass the AWS Machine Learning Specialty exam but also to become a better data scientist. There are many different ways to look at the outlier problem purely from a mathematical perspective; however, the datasets used in real life are derived from the underlying business process, so you must include a business perspective during an outlier analysis.

An outlier is an atypical data point in a set of data. For example, Figure 4.8 shows some data points that have been plotted in a two-dimension plan; that is, x and y. The red point is an outlier since it is an atypical value in this series of data.

Figure 4.8 – Identifying an outlier

It is important to treat outlier values because some statistical methods are impacted by them. Still, in Figure 4.8, you can see this behavior in action. On the left-hand side, there has been drawn a line that best fits those data points, ignoring the red point. On the right-hand side, the same line was drawn, but including the red point.

You can visually conclude that, by ignoring the outlier point, you will come up with a better solution on the plan of the left-hand side of the preceding chart since it was able to pass closer to most of the values. You can also prove this by computing an associated error for each line (which you will learn later in this book).

It is worth reminding that you have also seen the outlier issue in action in another situation in this book: specifically, in Table 4.10, while dealing with missing values. In that example, the median was used to work around the problem. Feel free to go back and read it again, but what should be very clear at this point is that median values are less impacted by outliers than average (mean) values.

You now know what outliers are and why you should treat them. You should always consider your business perspective while dealing with outliers, but there are mathematical methods to find them. Now, you are ready to move on and look at some methods for outlier detection.

You have already learned about the most common method: zscore. In Table 4.7, you saw a table containing a set of ages. Refer to it again to refresh your memory. In the last column of that table, it was computed the zscore of each age, according to the equation shown in Figure 4.4.

There is no well-defined range for those zscore values; however, in a normal distribution without outliers, they will mostly range between -3 and 3. Remember: zscore will give you the number of standard deviations from the mean of the distribution. Table 4.10 shows some of the properties of a normal distribution:

Figure 4.9 – Normal distribution properties. Image adapted from https://pt.wikipedia.org/wiki/Ficheiro:The_Normal_Distribution.svg

According to the normal distribution properties, 95% of values will belong to the range of -2 and 2 standard deviations from the mean, while 99% of the values will belong to the range of -3 and 3. Coming back to the outlier detection context, you can set thresholds on top of those zscore values to specify whether a data point is an outlier or not!

There is no standard threshold that you can use to classify outliers. Ideally, you should look at your data and see what makes more sense for you… usually (this is not a rule), you will use some number between 2 and 3 standard deviations from the mean to flag outliers, since less than 5% of your data will be selected by this rule (again, this is just a reference threshold, so that you can select some data from further scructizing). You may remember that there are outliers below and above the mean value of the distribution, as shown in Table 4.12, where the outliers were flagged with an absolute zscore greater than 3 (the value column is hidden for the sake of this demonstration).

ValueZscoreIs outlier?
1.3NO
0.8NO
3.1YES
-2.9NO
-3.5YES
1.0NO
1.1NO

Table 4.12 – Flagging outliers according to the zscore value

Two outliers were found in Table 4.12: row number three and row number five. Another way to find outliers in the data is by applying the box plot logic. When you look at a numerical variable, it is possible to extract many descriptive statistics from it, not only the mean, median, minimum, and maximum values, as you have seen previously. Another property that’s present in data distributions is known as quantiles.

Quantiles are cut-off points that are established at regular intervals from the cumulative distribution function of a random variable. Those regular intervals, also known as q-quantiles, will be nearly the same size and will receive special names in some situations:

  • The 4-quantiles are called quartiles.
  • The 10-quantiles are called deciles.
  • The 100-quantiles are called percentiles.

For example, the 20th percentile (of a 100-quantile regular interval) specifies that 20% of the data is below that point. In a box plot, you can use regular intervals of 4-quantiles (also known as quartiles) to expose the distribution of the data (Q1 and Q3), as shown in Figure 4.10.

Figure 4.10 – Box plot definition

Q1 is also known as the lower quartile or 25th quartile, and this means that 25% of the data is below that point in the distribution. Q3 is also known as the upper quartile or 75th quartile, and this means that 75% of the data is below that point in the distribution.

Computing the difference between Q1 and Q3 will give you the interquartile range (IQR) value, which you can then use to compute the limits of the box plot, shown by the “minimum” and “maximum” labels in the preceding diagram.

After all, you can finally infer that anything below the “minimum” value or above the “maximum” value of the box plot will be flagged as an outlier.

You have now learned about two different ways you can flag outliers on your data: zscore and box plot. You can decide whether you are going to remove these points from your dataset, transform them, or create another variable to specify that they exist (as shown in Table 4.11).

Moving further on this journey of data preparation and transformation, you will face other types of problems in real life. Next, you will learn that several use cases contain something known as rare events, which makes ML algorithms focus on the wrong side of the problem and propose bad solutions. Luckily, you will learn how to either tune hyperparameters or prepare the data to facilitate algorithm convergence while fitting rare events.