Dealing with unbalanced datasets – Data Preparation and Transformation – MLS-C01 Study Guide

Dealing with unbalanced datasets

At this point, you might have realized why data preparation is probably the longest part of the data scientist’s work. You have learned about data transformation, missing data values, and outliers, but the list of problems goes on. Don’t worry – you are on the right journey to master this topic!

Another well-known problem with ML models, specifically with classification problems, is unbalanced classes. In a classification model, you can say that a dataset is unbalanced when most of its observations belong to one (or some) of the classes (target variable).

This is very common in fraud identification systems: for example, where most of the events belong to a regular operation, while a very small number of events belong to a fraudulent operation. In this case, you can also say that fraud is a rare event.

There is no strong rule for defining whether a dataset is unbalanced or not, it really depends on the context of your business domain. Most challenge problems will contain more than 99% of the observations in the majority class.

The problem with unbalanced datasets is very simple: ML algorithms will try to find the best fit in the training data to maximize their accuracy. In a dataset where 99% of the cases belong to one single class, without any tuning, the algorithm is likely to prioritize the assertiveness of the majority class. In the worst-case scenario, it will classify all the observations as the majority class and ignore the minority one, which is usually our interest when modeling rare events.

To deal with unbalanced datasets, you have two major directions to follow: tuning the algorithm to handle the issue or resampling the data to make it more balanced.

By tuning the algorithm, you have to specify the weight of each class under classification. This class weight configuration belongs to the algorithm, not to the training data, so it is a hyperparameter setting. It is important to keep in mind that not all algorithms will have that type of configuration, and that not all ML frameworks will expose it, either. As a quick reference, the DecisionTreeClassifier class, from the scikit-learn ML library, is a good example that does implement the class weight hyperparameter.

Another way to work around unbalanced problems is changing the training dataset by applying undersampling or oversampling. If you decide to apply undersampling, all you have to do is remove some observations from the majority class until you get a more balanced dataset. Of course, the downside of this approach is that you may lose important information about the majority class that you are removing observations from.

The most common approach for undersampling is known as random undersampling, which is a naïve resampling approach where you randomly remove some observations from the training set.

On the other hand, you can decide to go for oversampling, where you will create new observations/samples of the minority class. The simplest approach is the naïve one, where you randomly select observations from the training set (with replacement) for duplication. The downside of this method is the potential issue of overfitting, since you will be duplicating/highlighting the observed pattern of the minority class.

To either underfit or overfit your model, you should always test the fitted model on your testing set.