Data Preparation with Python: Dealing with Outliers

This is an excerpt from our latest ebook Data Cleaning and Preprocessing for Data Science Beginners.

Outliers are unusual observations that significantly differ from the rest of the data. While outliers can sometimes indicate important findings or errors in data collection, they can also skew the data and lead to misleading results. This chapter will provide an overview of outliers and their impact, discuss different outlier detection techniques, and present strategies for handling outliers with practical Python examples.

Understanding Outliers and Their Impact

Outliers arise due to various reasons such as measurement errors, data processing errors, or true anomalies (e.g., a major event disrupting the usual process). Understanding them is critical because their presence can have substantial effects on your data analysis. They can:

Affect Mean and Standard Deviation

Outliers can significantly skew your mean and inflate the standard deviation, distorting the overall data distribution.

Impact Model Accuracy

Many machine learning algorithms are sensitive to the range and distribution of attribute values. Outliers can mislead the training process, resulting in longer training times and less accurate models.

Let’s demonstrate how outliers can skew the mean using a simple Python example:

import numpy as np

# Regular data
regular_data = np.array([10, 20, 30, 40, 50])
print(f'Mean of regular data: {regular_data.mean()}')

# Data with an outlier
outlier_data = np.array([10, 20, 30, 40, 500])  # 500 is an outlier
print(f'Mean of data with an outlier: {outlier_data.mean()}')

Outlier Detection Techniques

Outlier detection can be performed using several methods, each with its advantages and limitations. Here are a few common ones:

Statistical Methods

Z-score: The Z-score is a measure of how many standard deviations an observation is from the mean. A common rule of thumb is that a data point with a Z-score greater than 3 or less than -3 is considered an outlier.

from scipy import stats

z_scores = np.abs(stats.zscore(outlier_data))
outliers = outlier_data[(z_scores > 3)]

IQR method: The Interquartile Range (IQR) method identifies as outliers the data points that fall below the first quartile or above the third quartile by a factor of the IQR. A common factor to use is 1.5.

Q1 = np.percentile(outlier_data, 25)
Q3 = np.percentile(outlier_data, 75)
IQR = Q3 - Q1

outliers = outlier_data[((outlier_data < (Q1 - 1.5 * IQR)) | (outlier_data > (Q3 + 1.5 * IQR)))]

To learn more about dealing with outliers with Python during data preparation, check out our latest ebook Data Cleaning and Preprocessing for Data Science Beginners.

DOWNLOAD NOW