Widhi
Widhi

Reputation: 39

Confusion Matrix return single matrix

I found an issue with scikit confusion matrix.

I use confusion matrix inside KFold, then when the y_true and y_pred is 100% correct, the confusion matrix return a single number. This make my confusion matrix variable broke, because i add the result from confusion matrix in each fold. Any one have solution for this?

Here is my code

model = MultinomialNB()
kf = KFold(n_splits=10)
cf = np.array([[0, 0], [0, 0]])
for train_index, test_index in kf.split(x):
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    cf += confusion_matrix(y_test, y_pred)

Thank You

Upvotes: 0

Views: 3093

Answers (2)

Bonlenfum
Bonlenfum

Reputation: 20195

The cleanest way is probably to pass a list of all possible classes in as the labels argument. Here is an example that shows the issue and it being resolved (based on spoofed data for the truth and predictions).

from sklearn.metrics import confusion_matrix                                      
import numpy as np                                                                

y_test = np.array([1,1,1,1,1,0,0])                                                
y_pred = np.array([0,1,1,1,1,0,0])                                                

labels = np.unique(y_test)                                                        

cf = np.array([[0, 0], [0, 0]])                                                   

for indices in [ [0,1,2,3], [1,2,3] , [1,2,3,4,5,6]]:                             
    cm1= confusion_matrix(y_test[indices], y_pred[indices])                       
    cm2= confusion_matrix(y_test[indices], y_pred[indices], labels=labels)        
    print (cm1.shape == (2,2), cm2.shape == (2,2))                                

In the first subset, both classes appear; but in the second subset, only one class appears and so the cm1 matrix is not of size (2,2) (it comes out as (1,1)). But note that by indicating all potential classes in labels, cm2 is always ok.

If you already know that the labels can only be 0 or 1, you could just assign labels=[0,1], but using np.unique will be more robust.

Upvotes: 2

J. Doe
J. Doe

Reputation: 3634

You can check first if all pred_values are all equal to true_values. If it is the case, then just increment your 00 and 11 confusion matrix values by the true_values (or pred_values).

X = pd.DataFrame({'f1': [1]*10 + [0]*10,
                  'f2': [3]*10 + [10]*10}).values
y = np.array([1]*10 + [0]*10)
model = MultinomialNB()
kf = KFold(n_splits=5)
cf = np.array([[0, 0], [0, 0]])
for train_index, test_index in kf.split(X):
    x_train, x_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    if all(y_test == y_pred): # if perfect prediction
        cf[0][0] += sum(y_pred == 0) # increment by number of 0 values
        cf[1][1] += sum(y_pred == 1) # increment by number of 1 values
    else:
        cf += confusion_matrix(y_test, y_pred) # else add cf values

Result of print(cf)

>> [10  0]
   [0  10]

Be careful to overfitting

Upvotes: 0

Related Questions