Handling missing values – Data Preparation and Transformation – MLS-C01 Study Guide

Handling missing values

As the name suggests, missing values refer to the absence of data. Such absences are usually represented by tokens, which may or may not be implemented in a standard way.

Although using tokens is standard, the way those tokens are displayed may vary across different platforms. For example, relational databases represent missing data with NULL, core Python code will use None, and some Python libraries will represent missing numbers as Not a Number (NaN).

Important note

For numerical fields, don’t replace those standard missing tokens with zeros. By default, zero is not a missing value, but another number.

However, in real business scenarios, you may or may not find those standard tokens. For example, a software engineering team might have designed the system to automatically fill missing data with specific tokens, such as “unknown” for strings or “-1” for numbers. In that case, you would have to search by those two tokens to find missing data. People can set anything.

In the previous example, the software engineering team was still kind enough to give you standard tokens. However, there are many cases where legacy systems do not add any data quality layer in front of the user, and you may find an address field filled with, “I don’t want to share,” or a phone number field filled with, “Don’t call me.” This is clearly missing data, but not as standard as the previous example.

There are many more nuances that you will learn regarding missing data, all of which you will learn in this section, but be advised: before you start making decisions about missing values, you should prepare a good data exploration and make sure you find those values. You can either compute data frequencies or use missing plots, but please do something. Never assume that your missing data is represented only by those handy standard tokens.

Why should you care about this type of data? Well, first, because most algorithms (apart from decision trees implemented on very specific ML libraries) will raise errors when they find a missing value. Second (and maybe most important), by grouping all the missing data in the same bucket, you are assuming that they are all the same, but in reality, you don’t know that.

Such a decision will not only add bias to your model – it will reduce its interpretability, as you will be unable to explain the missing data. Once you know why you want to treat the missing values, then you can analyze your options.

Theoretically, you can classify missing values into two main groups: MCAR or MNAR. MCAR stands for Missing Completely at Random and states that there is no pattern associated with the missing data. On the other hand, MNAR stands for Missing Not at Random and means that the underlying process used to generate the data is strictly connected to the missing values.

Look at the following example about MNAR missing values. Suppose you are collecting user feedback about a particular product in an online survey. Your process of asking questions is dynamic and depends on user answers. When a user specifies an age lower than 18 years old, you never ask his/her marital status. In this case, missing values of marital status are connected to the age of the user (MNAR).

Knowing the class of missing values that you are dealing with will help you understand whether you have any control over the underlying process that generates the data. Sometimes, you can come back to the source process and, somehow, complete your missing data.