Important note – Data Preparation and Transformation – MLS-C01 Study Guide

Important note

Before feeding any ML algorithm with data, make sure your feature types have been properly identified.

In theory, if you are happy with your features and have properly classified each of them, you should be ready to go into the modeling phase of the CRISP-DM methodology, shouldn’t you? Well, maybe not. There are many reasons you may want to spend a little more time on data preparation, even after you have correctly classified your features:

  • Some ML libraries, such as scikit-learn, may not accept string values on your categorical features.
  • The data distribution of your variable may not be the most optimal distribution for your algorithm.
  • Your ML algorithm may be impacted by the scale of your data.
  • Some observations of your variable may be missing information and you will have to fix it. These are also known as missing values.
  • You may find outlier values of your variable that can potentially add bias to your model.
  • Your variable may be storing different types of information and you may only be interested in a few of them (for example, a date variable can store the day of the week or the week of the month).
  • You might want to find a mathematical representation for a text variable.
  • Believe me, this list has no real end!

In the following sections, you will understand how to address all these points, starting with categorical features.

Dealing with categorical features

Data transformation methods for categorical features will vary according to the sub-type of your variable. In the upcoming sections, you will understand how to transform nominal and ordinal features.

Transforming nominal features

You may have to create numerical representations of your categorical features before applying ML algorithms to them. Some libraries may have embedded logic to handle that transformation for you, but most of them do not.

The first transformation you will learn is known as label encoding. A label encoder is suitable for categorical/nominal variables, and it will just associate a number with each distinct label of your variables. Table 4.2 shows how a label encoder works:

CountryLabel encoding
India1
Canada2
Brazil3
Australia4
India1

Table 4.2 – Label encoder in action

A label encoder will always ensure that a unique number is associated with each distinct label. In the preceding table, although “India” appears twice, the same number was assigned to it.

You now have a numerical representation of each country, but this does not mean you can use that numerical representation in your models! In this particular case, you are transforming a nominal feature, which does not have an order.

According to Table 4.2, if you pass the encoded version of the country variable to a model, it will make assumptions such as “Brazil (3) is greater than Canada (2),” which does not make any sense.

One possible solution for that scenario is applying another type of transformation on top of “country”: one-hot encoding. This transformation will represent all the categories from the original feature as individual features (also known as dummy variables), which will store the presence or absence of each category. Table 4.3 is transforming the same information from Table 4.2, but this time it’s applying one-hot encoding:

CountryIndiaCanadaBrazilAustralia
India1000
Canada0100
Brazil0010
Australia0001
India1000

Table 4.3 – One-hot encoding in action

You can now use the one-hot encoded version of the country variable as a feature of an ML model. However, your work as a skeptical data scientist is never done, and your critical thinking ability will be tested in the AWS Machine Learning Specialty exam.

Suppose you have 150 distinct countries in your dataset. How many dummy variables would you come up with? 150, right? Here, you just ran into a potential issue: apart from adding complexity to your model (which is not a desired characteristic of any model at all), dummy variables also add sparsity to your data.

A sparse dataset has a lot of variables filled with zeros. Often, it is hard to fit this type of data structure into memory (you can easily run out of memory), and it is very time-consuming for ML algorithms to process sparse structures.

You can work around the sparsity problem by grouping your original data and reducing the number of categories, and you can even use custom libraries that compress your sparse data and make it easier to manipulate (such as scipy.sparse.csr_matrix, from Python).

Therefore, during the exam, remember that one-hot encoding is definitely the right way to go when you need to transform categorical/nominal data to feed ML models; however, take the number of unique categories of your original feature into account and think about whether it makes sense to create dummy variables for all of them (it might not make sense if you have a very large number of unique categories).