Data Preparation and Transformation – MLS-C01 Study Guide

You have probably heard that data scientists spend most of their time working on data-preparation-related activities. It is now time to explain why that happens and what types of activities they work on.

In this chapter, you will learn how to deal with categorical and numerical features, as well as how to apply different techniques to transform your data, such as one-hot encoding, binary encoders, ordinal encoding, binning, and text transformations. You will also learn how to handle missing values and outliers in your data, which are two important tasks you can implement to build good machine learning (ML) models.

In this chapter, you will cover the following topics:

  • Identifying types of features
  • Dealing with categorical features
  • Dealing with numerical features
  • Understanding data distributions
  • Handling missing values
  • Dealing with outliers
  • Dealing with unbalanced datasets
  • Dealing with text data

This chapter is a little longer than the others and will require more patience. Knowing about these topics in detail will put you in a good position for the AWS Machine Learning Specialty exam.

Identifying types of features

You cannot start modeling without knowing what a feature is and what type of information it can store. You have already read about the different processes that deal with features. For example, you know that feature engineering is related to the task of building and preparing features for your models; you also know that feature selection is related to the task of choosing the best set of features to feed a particular algorithm. These two tasks have one behavior in common: they may vary according to the types of features they are processing.

It is very important to understand this behavior (feature type versus applicable transformations) because it will help you eliminate invalid answers during your exam (and, most importantly, you will become a better data scientist).

By types of features, you refer to the data type that a particular feature is supposed to store. Figure 4.1 shows how you could potentially describe the different types of features of a model.

Figure 4.1 – Feature types

In Chapter 1, Machine Learning Fundamentals, you were introduced to the feature classification shown in Figure 4.1. Now, look at some real examples, so that you can eliminate any remaining questions you may have:

Feature typeFeature sub-typeDefinitionExample
CategoricalNominalLabelled variables with no quantitative valueCloud provider: AWS, MS, Google
CategoricalOrdinalAdds the sense of order to the labelled variableJob title: junior data scientist, senior data scientist, chief data scientist
CategoricalBinaryA variable with only two allowed valuesFraud classification: fraud, not fraud
NumericalDiscreteIndividual and countable itemsNumber of students: 100
NumericalContinuousInfinite number of possible measurements and they often carry decimal pointsTotal amount: $150.35

Table 4.1 – Real examples of feature values

Although looking at the values of the variable may help you find its type, you should never rely only on this approach. The nature of the variable is also very important for making such decisions. For example, someone could encode the cloud provider variable shown in Table 4.1 as follows: 1 (AWS), 2 (MS), 3 (Google). In that case, the variable is still a nominal feature, even if it is now represented by discrete numbers.

If you are building an ML model and you don’t tell your algorithm that this variable is not a discrete number but is instead a nominal variable, the algorithm will treat it as a number and the model won’t be interpretable anymore.