Reputation: 39
I found an issue with scikit confusion matrix.
I use confusion matrix inside KFold, then when the y_true and y_pred is 100% correct, the confusion matrix return a single number. This make my confusion matrix variable broke, because i add the result from confusion matrix in each fold. Any one have solution for this?
Here is my code
model = MultinomialNB()
kf = KFold(n_splits=10)
cf = np.array([[0, 0], [0, 0]])
for train_index, test_index in kf.split(x):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
cf += confusion_matrix(y_test, y_pred)
Thank You
Upvotes: 0
Views: 3093
Reputation: 20195
The cleanest way is probably to pass a list of all possible classes in as the labels
argument. Here is an example that shows the issue and it being resolved (based on spoofed data for the truth and predictions).
from sklearn.metrics import confusion_matrix
import numpy as np
y_test = np.array([1,1,1,1,1,0,0])
y_pred = np.array([0,1,1,1,1,0,0])
labels = np.unique(y_test)
cf = np.array([[0, 0], [0, 0]])
for indices in [ [0,1,2,3], [1,2,3] , [1,2,3,4,5,6]]:
cm1= confusion_matrix(y_test[indices], y_pred[indices])
cm2= confusion_matrix(y_test[indices], y_pred[indices], labels=labels)
print (cm1.shape == (2,2), cm2.shape == (2,2))
In the first subset, both classes appear; but in the second subset, only one class appears and so the cm1 matrix is not of size (2,2) (it comes out as (1,1)). But note that by indicating all potential classes in labels
, cm2 is always ok.
If you already know that the labels can only be 0 or 1, you could just assign labels=[0,1], but using np.unique
will be more robust.
Upvotes: 2
Reputation: 3634
You can check first if all pred_values
are all equal to true_values
. If it is the case, then just increment your 00
and 11
confusion matrix values by the true_values
(or pred_values
).
X = pd.DataFrame({'f1': [1]*10 + [0]*10,
'f2': [3]*10 + [10]*10}).values
y = np.array([1]*10 + [0]*10)
model = MultinomialNB()
kf = KFold(n_splits=5)
cf = np.array([[0, 0], [0, 0]])
for train_index, test_index in kf.split(X):
x_train, x_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
if all(y_test == y_pred): # if perfect prediction
cf[0][0] += sum(y_pred == 0) # increment by number of 0 values
cf[1][1] += sum(y_pred == 1) # increment by number of 1 values
else:
cf += confusion_matrix(y_test, y_pred) # else add cf values
Result of print(cf)
>> [10 0]
[0 10]
Be careful to overfitting
Upvotes: 0