Zarobiek
Zarobiek

Reputation: 1

Python - difference in confusion matrix dimension

I have a question about confusion matrix. I use cross validation to split 148 instances for two array - test and train. Than I call something like that:

def GenerateResult:
   clf = OneVsRestClassifier(GaussianNB())
   clf.fit(x_train, y_train)
   predictions = clf.predict(x_test)
   accuracy = accuracy_score(y_test, predictions)
   confusion_mtrx = confusion_matrix(y_test, predictions)

that is a loop for KFold -> I call function from up:

for train_idx, test_idx in pf.split(x_array):
       x_train, x_test = x_array[train_idx], x_array[test_idx]
       y_train, y_test = y_array[train_idx], y_array[test_idx]
       acc, confusion= GenerateResult(x_train, x_test, y_train, y_test)
       results['First'].append(acc)
       confusion_dict['First'].append(confusion)

Then I sum result and calculate mean

np_gausian = np.asarray(results['gaussian'])
print("[First] Mean: {}".format(np.mean(np_gausian)))

print(confusion_dict['gaussian'])

And I have a problem. In my 148 instances I have 4 classes in output and when I use that loop for KFold I have result with two different confusion matrix. First confusion matrix 3x3:

[[36  1  1]

 [15 17  1]

 [ 0  0  3]]

Second 4x4 :

[[ 0  2  0  0]

 [ 0 41  2  0]

 [ 0 12 16  0]

 [ 0  0  1  0]]

I think that I have a problem with it becouse in my 148 instance I have

What should I do with it? How can I sum that confusion matrix? What if I change the number of split in KFold? I try to use Pandas but I don't have an idea how to do it. Please help, I use sk-learn for it

Upvotes: 0

Views: 1663

Answers (1)

Vivek Kumar
Vivek Kumar

Reputation: 36617

As noted in comment by @KRKirov, the reason for this is due to Kfold cross-validation splitting of data into folds, some classes are not present in the test set of that fold.

For example, if class1 is not present in y_test, and not predicted also in predictions, then the confusion_matrix code will infer automatically that there are only three classes present in the data and generate the matrix according to that.

You can force the confusion_matrix to use all classes by setting the labels param:-

labels : array, shape = [n_classes], optional

List of labels to index the matrix. This may be used to reorder or
select a subset of labels. If none is given, those that appear at
least once in y_true or y_pred are used in sorted order.

by doing this:

confusion_mtrx = confusion_matrix(y_test, predictions, 
                                 labels = np.unique(y_array))

You need to pass the y_array or unique labels from y_array to the GenerateResult() method.

Upvotes: 0

Related Questions