How To Handle Imbalanced Data In Machine Learning Classification

January 18, 2022

16 min read

In this tutorial, you’ll learn about imbalanced data and how to handle them in machine learning classification in Python.

Imbalanced data occurs when the classes of the dataset are distributed unequally. It is common for machine learning classification prediction problems.

An extreme example could be when 99.9% of your data set is class A (majority class). At the same time, only 0.1% is class B (minority class). Suppose you throw such data directly into machine learning algorithms. You may find the model ‘ignores’ the minority class and gives wrong predictions of it. It is frustrating since the goal is often to predict the minority class.

So, it is critical to understand and handle the imbalanced problem. Throughout this practical tutorial, you’ll use a highly imbalanced data example, and learn:

What is imbalanced data in machine learning classification
How to train and evaluate prediction results
How to deal with it using 6 techniques:
- Collecting a bigger sample
- Oversampling (e.g., random, SMOTE)
- Undersampling (e.g., random, K-Means, Tomek links)
- Combining over and undersampling
- Weighing classes differently
- Changing algorithms
Lots more.
All in Python!

In the end, you should be ready to make better predictions based on your imbalanced data.

Let’s jump in!

What is imbalanced data in machine learning?

Given a dataset with known labels/classes, we can model to predict the class a new observation belongs to. This is called the machine learning classification problem. Within it, we have imbalanced data when the number of observations across classes is not equal or close to equal.

For example, for a dataset of credit card transactions, there could be 99.9% of legitimate transactions and only 0.1% of fraud. This is a highly imbalanced dataset.

So what is the problem with imbalanced data in machine learning?

While a slight imbalance wouldn’t be a problem, a highly imbalanced dataset could cause issues for our classification predictions. This is because most machine learning algorithms rely on sufficient data. When some of the classes have little data, the algorithm can’t correctly predict its result.

Back to the credit card fraud detection example. Since the fraudulent data is underrepresented, a machine learning algorithm often gives poor predictions for such classes. This is problematic since we want to detect fraudulent transactions and catch them.

Besides credit card fraud detection, other fields also tend to have highly imbalanced datasets. For example:

Claim prediction/fraud detection in insurance companies
Spam detection
Customer churn/conversion prediction

Since machine learning classification could be binary (2-class) or multi-class, the imbalanced data problem could be for both. This tutorial will focus on imbalanced data in machine learning for binary classes, but you could extend the concept to multi-class.

Evaluation metrics: accuracy pitfall

Before diving into our example, let’s discuss the evaluation metrics. This is a critical choice for an imbalanced dataset.

For classification problems, we often use accuracy as the evaluation metric. It is easy to calculate and intuitive:

Accuracy = # of correct predictions / # of total predictions

But, it is misleading for highly imbalanced datasets. For the example of credit card fraud detection, we can set a model to always classify new transactions as legit. The accuracy could be high at 99.9% if 99.9% in the dataset is all legit.

What an ‘accurate’ model!

But, don’t forget that our goal is to detect fraud, so such a model is useless.

So for the imbalanced dataset, we must look at a broader picture of the prediction results. We could use other evaluation metrics such as Area Under the ROC Curve (AUC), F-score, Precision-Recall Curve.

In this tutorial, we’ll use AUC as the evaluation metric.

It’s a single metric that’s easy to use. AUC has the highest value of 1 when the classifier can predict 100% correctly.

We’ll calculate the AUC of using the original imbalanced dataset, versus the rebalanced datasets. So you can compare them and get an idea of the potential improvement of applying the imbalanced data techniques. Yet, please note that the improvement varies for different datasets or machine learning algorithms.

Now, let’s get to our example of imbalanced data.

Example of an imbalanced dataset

In this section, we’ll look at our example of an imbalanced dataset. We’ll quickly get it ready for applying the imbalanced data techniques.

The dataset is about abalone. If you’ve never heard of abalone, it is a species of marine snails. Our goal is to identify whether an abalone belongs to a specific class of 19. So this is a binary classification problem of either positive (class 19) or negative.

You can download the data here. It’s a small and straightforward dataset.

Loading data

First, let’s load and look at the dataset in Python.

import pandas as pd

df = pd.read_csv(‘abalone19.dat’)

df.info()

	df[‘Sex’].value_counts()

	df[‘Class’].value_counts()

	df[‘Class’] = df[‘Class’].map(lambda x: 0 if x == ‘negative’ else 1)
	df = pd.get_dummies(df, columns=[‘Sex’], drop_first=True)
	df

	Length	Diameter	Height	Whole_weight	Shucked_weight	Viscera_weight	Shell_weight	Class	Sex_I	Sex_M
0	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.1500	0	0	1
1	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.0700	0	0	1
2	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.2100	0	0	0
3	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.1550	0	0	1
4	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.0550	0	1	0
…	…	…	…	…	…	…	…	…	…	…
4169	0.560	0.430	0.155	0.8675	0.4000	0.1720	0.2290	0	0	1
4170	0.565	0.450	0.165	0.8870	0.3700	0.2390	0.2490	0	0	0
4171	0.590	0.440	0.135	0.9660	0.4390	0.2145	0.2605	0	0	1
4172	0.600	0.475	0.205	1.1760	0.5255	0.2875	0.3080	0	0	1
4173	0.625	0.485	0.150	1.0945	0.5310	0.2610	0.2960	0	0	0

	df[‘Class’].value_counts(normalize=True)

	df[‘Class’].value_counts().plot(kind=‘bar’)

How To Handle Imbalanced Data In Machine Learning Classification

What is imbalanced data in machine learning?

Evaluation metrics: accuracy pitfall

Example of an imbalanced dataset

Loading data

Transforming categorical columns

Splitting training and test sets

1. Collecting a bigger sample (if possible)

2. Oversampling

Simple random oversampling

pandas

imbalanced-learn

Oversampling with shrinkage

Oversampling using SMOTE

3. Undersampling

Simple random undersampling

pandas

imbalanced-learn

Undersampling using K-Means

Undersampling using Tomek links

4. Combining Oversampling and Undersampling

SMOTE and Tomek links

5. Weighing classes differently

6. Changing algorithms

Which technique to choose?

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Share this article

Stadler Sets World Record With Battery-Operated Train

AI-driven Sentiment Analysis: Hacking Emotions To Boost Customer Service

Read next