Introduction
Evaluating the performance of machine learning models is a critical part of the model building process. For classification models, there are a number of important metrics that can be used to determine how well a model is performing.
In this article, we will focus on evaluating classification models in scikit-learn using some common evaluation metrics. Scikit-learn provides implementations of all the metrics we will cover, making it easy to calculate them as part of your model evaluation workflow.
Accuracy
Accuracy is the most basic performance metric for classification models. It is defined as the number of correct predictions divided by the total number of predictions. In scikit-learn, accuracy can be computed using the accuracy_score
function:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_true, y_pred)
Where y_true
are the true target values and y_pred
are the predicted values from your model.
Accuracy works best when you have symmetric datasets where false positives and false negatives are equally undesirable. For imbalanced datasets, accuracy can fail to capture how well the model is actually performing.
Precision and Recall
Two other useful metrics are precision and recall. Precision refers to the proportion of positive predictions that are actually correct. Recall refers to the proportion of actual positive instances that are correctly predicted positive. They are defined as:
precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_positives + false_negatives)
Scikit-learn provides the precision_score
and recall_score
functions to calculate these metrics:
from sklearn.metrics import precision_score, recall_score
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
For classification problems where false negatives or false positives are more detrimental, looking at precision and recall separately can provide a more nuanced view of performance than just accuracy.
F1 Score
The F1 score combines precision and recall into a single metric by taking their harmonic mean. It provides a balance between precision and recall and is commonly used as a summary metric for classifiers, particularly when there is an uneven class distribution.
The F1 score is calculated as:
F1 = 2 * (precision * recall) / (precision + recall)
And can be computed in scikit-learn using f1_score
:
from sklearn.metrics import f1_score
f1 = f1_score(y_true, y_pred)
F1 scores range from 0 to 1, with higher scores indicating better performance. It is a useful metric for imbalanced datasets or when false positives and false negatives have different costs.
ROC AUC
For binary classification problems, another common metric is the ROC AUC score. ROC stands for Receiver Operating Characteristic, and AUC stands for Area Under the Curve. The ROC curve is a plot of the true positive rate against the false positive rate as the classification threshold varies. The AUC score summarizes the curve as the total area underneath it. A perfect classifier will have an AUC of 1, while a purely random classifier will have an AUC of around 0.5.
Scikit-learn provides roc_auc_score
to calculate this metric:
from sklearn.metrics import roc_auc_score
roc_auc = roc_auc_score(y_true, y_pred)
ROC AUC is useful when you need to evaluate and compare multiple binary classifiers. It captures how well the models are able to distinguish the positive and negative classes overall.
Conclusion
Metric | Description | Best Use Case |
---|---|---|
Accuracy | Number of correct predictions divided by the total number of predictions | Symmetric datasets where false positives and false negatives are equally undesirable |
Precision | Proportion of positive predictions that are actually correct | Classification problems where false positives are more detrimental |
Recall | Proportion of actual positive instances that are correctly predicted positive | Classification problems where false negatives are more detrimental |
F1 Score | Harmonic mean of precision and recall, providing a balance between the two | Imbalanced datasets or when false positives and false negatives have different costs |
ROC AUC | Summarizes the ROC curve as the total area underneath it, indicating the ability to distinguish between positive and negative classes | Evaluation and comparison of multiple binary classifiers |
Evaluating classification models properly is critical for understanding their performance characteristics and selecting the right model for your application. Scikit-learn provides many useful metrics like accuracy, precision/recall, F1 score, and ROC AUC to quantify model performance on classification problems.
Consider what aspects of performance are most important for your particular use case, and select evaluation metrics accordingly. Tracking these metrics during model development can help you iteratively improve your models.