You have probably heard that data scientists spend most of their time working on data-preparation-related activities. It is now time to explain why that happens and what types of activities they work on.
In this chapter, you will learn how to deal with categorical and numerical features, as well as how to apply different techniques to transform your data, such as one-hot encoding, binary encoders, ordinal encoding, binning, and text transformations. You will also learn how to handle missing values and outliers in your data, which are two important tasks you can implement to build good machine learning (ML) models.
In this chapter, you will cover the following topics:
This chapter is a little longer than the others and will require more patience. Knowing about these topics in detail will put you in a good position for the AWS Machine Learning Specialty exam.
You cannot start modeling without knowing what a feature is and what type of information it can store. You have already read about the different processes that deal with features. For example, you know that feature engineering is related to the task of building and preparing features for your models; you also know that feature selection is related to the task of choosing the best set of features to feed a particular algorithm. These two tasks have one behavior in common: they may vary according to the types of features they are processing.
It is very important to understand this behavior (feature type versus applicable transformations) because it will help you eliminate invalid answers during your exam (and, most importantly, you will become a better data scientist).
By types of features, you refer to the data type that a particular feature is supposed to store. Figure 4.1 shows how you could potentially describe the different types of features of a model.
Figure 4.1 – Feature types
In Chapter 1, Machine Learning Fundamentals, you were introduced to the feature classification shown in Figure 4.1. Now, look at some real examples, so that you can eliminate any remaining questions you may have:
Feature type | Feature sub-type | Definition | Example |
Categorical | Nominal | Labelled variables with no quantitative value | Cloud provider: AWS, MS, Google |
Categorical | Ordinal | Adds the sense of order to the labelled variable | Job title: junior data scientist, senior data scientist, chief data scientist |
Categorical | Binary | A variable with only two allowed values | Fraud classification: fraud, not fraud |
Numerical | Discrete | Individual and countable items | Number of students: 100 |
Numerical | Continuous | Infinite number of possible measurements and they often carry decimal points | Total amount: $150.35 |
Table 4.1 – Real examples of feature values
Although looking at the values of the variable may help you find its type, you should never rely only on this approach. The nature of the variable is also very important for making such decisions. For example, someone could encode the cloud provider variable shown in Table 4.1 as follows: 1 (AWS), 2 (MS), 3 (Google). In that case, the variable is still a nominal feature, even if it is now represented by discrete numbers.
If you are building an ML model and you don’t tell your algorithm that this variable is not a discrete number but is instead a nominal variable, the algorithm will treat it as a number and the model won’t be interpretable anymore.