Reputation: 1
I have a question about confusion matrix. I use cross validation to split 148 instances for two array - test and train. Than I call something like that:
def GenerateResult:
clf = OneVsRestClassifier(GaussianNB())
clf.fit(x_train, y_train)
predictions = clf.predict(x_test)
accuracy = accuracy_score(y_test, predictions)
confusion_mtrx = confusion_matrix(y_test, predictions)
that is a loop for KFold -> I call function from up:
for train_idx, test_idx in pf.split(x_array):
x_train, x_test = x_array[train_idx], x_array[test_idx]
y_train, y_test = y_array[train_idx], y_array[test_idx]
acc, confusion= GenerateResult(x_train, x_test, y_train, y_test)
results['First'].append(acc)
confusion_dict['First'].append(confusion)
Then I sum result and calculate mean
np_gausian = np.asarray(results['gaussian'])
print("[First] Mean: {}".format(np.mean(np_gausian)))
print(confusion_dict['gaussian'])
And I have a problem. In my 148 instances I have 4 classes in output and when I use that loop for KFold I have result with two different confusion matrix. First confusion matrix 3x3:
[[36 1 1]
[15 17 1]
[ 0 0 3]]
Second 4x4 :
[[ 0 2 0 0]
[ 0 41 2 0]
[ 0 12 16 0]
[ 0 0 1 0]]
I think that I have a problem with it becouse in my 148 instance I have
Class 1 - 2 ea
Class 2 - 81 ea
Class 3 - 61 ea
Class 4 - 4 ea
All Class - 148
What should I do with it? How can I sum that confusion matrix? What if I change the number of split in KFold? I try to use Pandas but I don't have an idea how to do it. Please help, I use sk-learn for it
Upvotes: 0
Views: 1663
Reputation: 36617
As noted in comment by @KRKirov, the reason for this is due to Kfold cross-validation splitting of data into folds, some classes are not present in the test set of that fold.
For example, if class1 is not present in y_test
, and not predicted also in predictions
, then the confusion_matrix
code will infer automatically that there are only three classes present in the data and generate the matrix according to that.
You can force the confusion_matrix to use all classes by setting the labels
param:-
labels : array, shape = [n_classes], optional
List of labels to index the matrix. This may be used to reorder or select a subset of labels. If none is given, those that appear at least once in y_true or y_pred are used in sorted order.
by doing this:
confusion_mtrx = confusion_matrix(y_test, predictions,
labels = np.unique(y_array))
You need to pass the y_array or unique labels from y_array to the GenerateResult() method.
Upvotes: 0