Reputation: 3619
I calculated a confusion matrix for my classifier using confusion_matrix()
from scikit-learn. The diagonal elements of the confusion matrix represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier.
I would like to normalize my confusion matrix so that it contains only numbers between 0 and 1. I would like to read the percentage of correctly classified samples from the matrix.
I found several methods how to normalize a matrix (row and column normalization) but I don't know much about maths and am not sure if this is the correct approach.
Upvotes: 62
Views: 110959
Reputation: 5470
Using Seaborn you can easily print a normalised AND pretty confusion matrix with a heathmap:
from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(y_test, y_pred)
# Normalise
cmn = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(cmn, annot=True, fmt='.2f', xticklabels=target_names, yticklabels=target_names)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show(block=False)
Upvotes: 46
Reputation: 4100
For the case where you have the TOTALS. Something like this:
0 1 2 Total
0 5434084 567 3460 5438111
1 458896 4717484 115297 5291677
2 189553 8305 13962602 14160460
Total 6082533 4726356 14081359 24890248
My solution was:
cm = (cm.astype('float').T / cm.drop('Total', axis=1).sum(axis=1)).T
Upvotes: 0
Reputation: 60369
Nowadays, scikit-learn's confusion matrix comes with a normalize
argument; from the docs:
normalize : {'true', 'pred', 'all'}, default=None
Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, confusion matrix will not be normalized.
So, if you want the values normalized over all samples, you should use
confusion_matrix(y_true, y_pred, normalize='all')
Upvotes: 30
Reputation: 2206
I think the easiest way to do this is by doing:
c = sklearn.metrics.confusion_matrix(y, y_pred)
normed_c = (c.T / c.astype(np.float).sum(axis=1)).T
Upvotes: 5
Reputation: 2485
There's a library provided by scikit-learn itself for plotting graphs. It is based on matplotlib and it should be already installed to proceed further.
pip install scikit-plot
Now, just set normalize parameter to true:
import scikitplot as skplt
skplt.metrics.plot_confusion_matrix(Y_TRUE, Y_PRED, normalize=True)
Upvotes: 3
Reputation: 2622
From the sklearn documentation (plot example)
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
where cm is the confusion matrix as provided by sklearn.
Upvotes: 27
Reputation: 363737
Suppose that
>>> y_true = [0, 0, 1, 1, 2, 0, 1]
>>> y_pred = [0, 1, 0, 1, 2, 2, 1]
>>> C = confusion_matrix(y_true, y_pred)
>>> C
array([[1, 1, 1],
[1, 2, 0],
[0, 0, 1]])
Then, to find out how many samples per class have received their correct label, you need
>>> C / C.astype(np.float).sum(axis=1)
array([[ 0.33333333, 0.33333333, 1. ],
[ 0.33333333, 0.66666667, 0. ],
[ 0. , 0. , 1. ]])
The diagonal contains the required values. Another way to compute these is to realize that what you're computing is the recall per class:
>>> from sklearn.metrics import precision_recall_fscore_support
>>> _, recall, _, _ = precision_recall_fscore_support(y_true, y_pred)
>>> recall
array([ 0.33333333, 0.66666667, 1. ])
Similarly, if you divide by the sum over axis=0
, you get the precision (fraction of class-k
predictions that have ground truth label k
):
>>> C / C.astype(np.float).sum(axis=0)
array([[ 0.5 , 0.33333333, 0.5 ],
[ 0.5 , 0.66666667, 0. ],
[ 0. , 0. , 0.5 ]])
>>> prec, _, _, _ = precision_recall_fscore_support(y_true, y_pred)
>>> prec
array([ 0.5 , 0.66666667, 0.5 ])
Upvotes: 45
Reputation: 59250
The matrix output by sklearn's confusion_matrix()
is such that
C_{i, j} is equal to the number of observations known to be in group i but predicted to be in group j
so to get the percentages for each class (often called specificity and sensitivity in binary classification) you need to normalize by row: replace each element in a row by itself divided by the sum of the elements of that row.
Note that sklearn has a summary function available that computes metrics from the confusion matrix : classification_report. It outputs precision and recall rather than specificity and sensitivity, but those are often regarded as more informative in general (especially for imbalanced multi-class classification.)
Upvotes: 10
Reputation: 69954
I'm assuming that M[i,j]
stands for Element of real class i was classified as j
. If its the other way around you are going to need to transpose everything I say. I'm also going to use the following matrix for concrete examples:
1 2 3
4 5 6
7 8 9
There are essentially two things you can do:
The first thing you can ask is what percentage of elements of real class i
here classified as each class. To do so, we take a row fixing the i
and divide each element by the sum of the elements in the row. In our example, objects from class 2 are classified as class 1 4 times, are classified correctly as class 2 5 times and are classified as class 3 6 times. To find the percentages we just divide everything by the sum 4 + 5 + 6 = 15
4/15 of the class 2 objects are classified as class 1
5/15 of the class 2 objects are classified as class 2
6/15 of the class 2 objects are classified as class 3
The second thing you can do is to look at each result from your classifier and ask how many of those results originate from each real class. Its going to be similar to the other case but with columns instead of rows. In our example, our classifier returns "1" 1 time when the original class is 1, 4 times when the original class is 2 and 7 times when the original class is 3. To find the percentages we divide by the sum 1 + 4 + 7 = 12
1/12 of the objects classified as class 1 were from class 1
4/12 of the objects classified as class 1 were from class 2
7/12 of the objects classified as class 1 were from class 3
--
Of course, both the methods I gave only apply to single row column at a time and I'm not sure if it would be a good idea to actually modify your confusion matrix in this form. However, this should give the percentages you are looking for.
Upvotes: 16