Reputation: 380
I am using sklearns confusion_matrix
package to plot the results coupled with the accuracy, recall and precision score etc and the graph renders as it should. However I am slightly confused by what the different values for what the normalize
parameter mean. Why do we do it and what are the differences between the 3 options? As quoting from their documentation:
normalize{‘true’, ‘pred’, ‘all’}, default=None
Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population.
If None, confusion matrix will not be normalized.
Does it normalize the points to a percentage format to make it easily visually if datasets are too large? Or am I missing the point all together here. I have searched but the questions all appear to be stating how to do it, rather than the meaning behind them.
Upvotes: 2
Views: 5702
Reputation: 4629
A normalized version makes it easier to visually interpret how the labels are being predicted and also to perform comparisons. You can also pass values_format= '.0%'
to display the values as percentages. The normalize
parameter specifies what the denominator should be
Example:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_moons
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import train_test_split
# Generate some example data
X, y = make_moons(noise=0.3, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=10)
# Train the classifier
clf = LogisticRegression()
clf.fit(X, y)
plot_confusion_matrix(clf, X_test, y_test); plt.title("Not normalized");
plot_confusion_matrix(clf, X_test, y_test, values_format= '.0%', normalize='true'); plt.title("normalize='true'");
plot_confusion_matrix(clf, X_test, y_test, values_format= '.0%', normalize='pred'); plt.title("normalize='pred'");
plot_confusion_matrix(clf, X_test, y_test, values_format= '.0%', normalize='all'); plt.title("normalize='all'");
Upvotes: 2
Reputation: 19179
If you repeat an experiment with different sample sizes, you may want to compare confusion matrices across experiments. To do so, you wouldn't want to see the total counts for each matrix. Instead, you would want to see the counts normalized but you need to decide if you want terms normalized by total number of samples ("all"), predicted class counts ("pred"), or true class counts ("true"). For example:
In [30]: yt
Out[30]: array([1, 0, 0, 0, 0, 1, 1, 0, 0, 0])
In [31]: yp
Out[31]: array([0, 0, 1, 0, 1, 0, 0, 1, 0, 0])
In [32]: confusion_matrix(yt, yp)
Out[32]:
array([[4, 3],
[3, 0]])
In [33]: confusion_matrix(yt, yp, normalize='pred')
Out[33]:
array([[0.57142857, 1. ],
[0.42857143, 0. ]])
In [34]: confusion_matrix(yt, yp, normalize='true')
Out[34]:
array([[0.57142857, 0.42857143],
[1. , 0. ]])
In [35]: confusion_matrix(yt, yp, normalize='all')
Out[35]:
array([[0.4, 0.3],
[0.3, 0. ]])
Upvotes: 1
Reputation: 10545
Yes, you can think of it as a percentage. The default is to just show the absolute count value in each cell of the confusion matrix, i.e. how often each combination of true and predicted category levels occurrs.
But if you choose e.g. normalize='all'
, every count value will be divided by the sum of all count values, so that you have relative frequencies whose sum over the whole matrix is 1. Similarly, if you pick normalize='true'
, you will have relative frequencies per row.
Upvotes: 1