Handling Imbalanced Datasets in scikit-learn: Techniques and Best Practices

Introduction to Imbalanced Datasets

Imbalanced datasets, where the number of samples across different classes vary greatly, are very common in machine learning applications. Often, there are many more samples for the majority class compared to the minority class. For example, in fraud detection, there are many more normal transactions than fraudulent ones. In medical diagnosis, there are many more samples from healthy patients than ones with a particular disease. This imbalance can negatively impact the performance of machine learning models.

Models trained on imbalanced data tend to be biased towards the majority class and have poor predictive performance on the minority class, which is often the more important class for the problem at hand. Metrics like accuracy can be misleading, as a model can get very high accuracy just by predicting the majority class every time.

In this article, we will discuss techniques available in scikit-learn to handle imbalanced data and improve model metrics like precision, recall, F1-score, and ROC AUC. We will cover sampling techniques like random oversampling and undersampling to balance out the classes. We will also discuss cost-sensitive learning by assigning asymmetric costs to false positives and negatives to counter class imbalance.

In addition, we will go over ensembling techniques like bagging and boosting that can help improve predictive performance on the minority class. Finally, we will cover some best practices for evaluating models on skewed datasets, like using stratification and appropriate evaluation metrics such as precision, recall, and PR curves.

Dealing with imbalanced data requires thoughtful data preprocessing, algorithm tweaking, and model evaluation. By using the right techniques, we can build models that perform well on imbalanced real-world data.

Oversampling the Minority Class

A simple way to balance out an imbalanced dataset is to oversample the minority class by creating additional copies of the samples. In scikit-learn, the RandomOverSampler class can be used to randomly oversample the minority class.

from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

A more advanced oversampling technique is SMOTE, short for Synthetic Minority Oversampling Technique. SMOTE generates synthetic samples for the minority class instead of just duplicating them. This creates more variety in the training data.

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)  

Undersampling the Majority Class

Instead of oversampling the minority class, we can also undersample the majority class to balance out the class distribution. The RandomUnderSampler class randomly selects samples from the majority class to even out the class balance.

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)

A more advanced undersampling technique is Tomek links, which identifies majority class samples that are nearest neighbors of minority class samples and removes them.

from imblearn.under_sampling import TomekLinks  

tl = TomekLinks(sampling_strategy='majority')
X_resampled, y_resampled = tl.fit_resample(X, y)

Adjusting Class Weights

Another technique is to assign higher weights to samples from the minority class and lower weights to the majority class when training the model. Many scikit-learn models accept a class_weight parameter.

from sklearn.svm import SVC

class_weights = {0: 1.0,
                 1: 0.5} 

svc = SVC(class_weight=class_weights)
svc.fit(X, y)

The class_weight='balanced' option can also be used to automatically balance the weights inversely proportional to class frequencies.

Stratification

When splitting the training data into train and validation sets, it is important to stratify the split – that is, maintain the same class imbalance in both splits. This can be done using scikit-learn’s StratifiedShuffleSplit.

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, val_index in split.split(X, y):
   X_train, X_val = X[train_index], X[val_index]  
   y_train, y_val = y[train_index], y[val_index]

Stratification should also be used when performing cross-validation.

Evaluation Metrics

When dealing with imbalanced classes, accuracy can be misleading. Precision, recall, and F1-score are better metrics. Additionally, AUC-ROC can evaluate model discrimination ability independently of class imbalance.

Conclusion

Handling imbalanced datasets requires specialized techniques in scikit-learn. Oversampling and undersampling, class weight adjustment, and metrics like F1-score allow developing high-performance models for imbalanced classification tasks.

Some best practices include oversampling using SMOTE, stratification during train-test splits and cross-validation, and choosing appropriate metrics for model evaluation.