For those types of variables with a higher number of unique categories, a potential approach to creating a numerical representation for them is applying binary encoding. In this approach, the goal is transforming a categorical variable into multiple binary columns, but minimizing the number of new columns.
This process consists of three basic steps:
Table 4.4 shows how to convert the data from Table 4.2 into a binary variable.
Country | Label encoder | Binary | Col1 | Col2 | Col3 |
India | 1 | 001 | 0 | 0 | 1 |
Canada | 2 | 010 | 0 | 1 | 0 |
Brazil | 3 | 011 | 0 | 1 | 1 |
Australia | 4 | 100 | 1 | 0 | 0 |
India | 1 | 001 | 0 | 0 | 1 |
Table 4.4 – Binary encoding in action
As you can see, there are now three columns (Col1, Col2, and Col3) instead of four columns from the one-hot encoding transformation in Table 4.3.
Ordinal features have a very specific characteristic: they have an order. Because they have this quality, it does not make sense to apply one-hot encoding to them; if you do so, the underlying algorithm that is used to train your model will not be able to differentiate the implicit order of the data points associated with this feature.
The most common transformation for this type of variable is known as ordinal encoding. An ordinal encoder will associate a number with each distinct label of your variable, just like a label encoder does, but this time, it will respect the order of each category. The following table shows how an ordinal encoder works:
Education | Ordinal encoding |
Trainee | 1 |
Junior data analyst | 2 |
Senior data analyst | 3 |
Chief data scientist | 4 |
Table 4.5 – Ordinal encoding in action
You can now pass the encoded variable to ML models and they will be able to handle this variable properly, with no need to apply one-hot encoding transformations. This time, comparisons such as “senior data analyst is greater than junior data analyst” make total sense.
Do not forget the following statement: encoders are fitted on training data and transformed on test and production data. This is how your ML pipeline should work.
Suppose you have created a one-hot encoder that fits the data from Table 4.2 and returns data according to Table 4.3. In this example, assume this is your training data. Once you have completed your training process, you may want to apply the same one-hot encoding transformation to your testing data to check the model’s results.
In the scenario that was just described (which is a very common situation in modeling pipelines), you cannot retrain your encoder on top of the testing data! You should just reuse the previous encoder object that you have created on top of the training data. Technically, you shouldn’t use the fit method again but the transform method instead.
You may already know the reasons why you should follow this rule, but just as a reminder: the testing data was created to extract the performance metrics of your model, so you should not use it to extract any other knowledge. If you do so, your performance metrics will be biased by the testing data, and you cannot infer that the same performance (shown in the test data) is likely to happen in production (when new data will come in).
Alright, all good so far. However, what if your testing set has a new category that was not present in the training set? How are you supposed to transform this data?
Going back to the one-hot encoding example in Figure 4.3 (input data) and Table 4.3 (output data), this encoder knows how to transform the following countries: Australia, Brazil, Canada, and India. If you had a different country in the testing set, the encoder would not know how to transform it, and that’s why you need to define how it will behave in scenarios where there are exceptions.
Most ML libraries provide specific parameters for these situations. In the previous example, you could configure the encoder to either raise an error or set all zeros on our dummy variables, as shown in Table 4.6.
Country | India | Canada | Brazil | Australia |
India | 1 | 0 | 0 | 0 |
Canada | 0 | 1 | 0 | 0 |
Brazil | 0 | 0 | 1 | 0 |
Australia | 0 | 0 | 0 | 1 |
India | 1 | 0 | 0 | 0 |
Portugal | 0 | 0 | 0 | 0 |
Table 4.6 – Handling unknown values on one-hot encoding transformations
As you can see, Portugal was not present in the training set (Table 4.2), so during the transformation, it will keep the same list of known countries and say that Portugal is not among them (all zeros).
As the very good skeptical data scientist you are becoming, should you be concerned about the fact that you have a particular category that has not been used during training? Well, maybe. This type of analysis really depends on your problem domain.
Handling unknown values is very common and something that you should expect to do in your ML pipeline. However, you should also ask yourself, due to the fact that you did not use that particular category during your training process, whether your model can be extrapolated and generalized.
Remember, your testing data must follow the same data distribution as your training data, and you are very likely to find all (or at least most) of the categories (of a categorical feature) either in the training or testing sets. Furthermore, if you are facing overfitting issues (doing well in the training, but poorly in the testing set) and, at the same time, you realize that your categorical encoders are transforming a lot of unknown values in the test set, guess what? It’s likely that your training and testing samples are not following the same distribution, invalidating your model entirely.
As you can see, slowly, you are getting there. You are learning about bias and investigation strategies in fine-grained detail – that is so exciting! Now, move on and look at performing transformations on numerical features. Yes, each type of data matters and drives your decisions.