Cross validation inconsistent numbers of samples error (Python)

Question

I am trying to make a classification using cross validation method and SVM classifier. In my data file, the last column contains my classes (which are 0, 1, 2, 3, 4, 5) and the rest (except first column) is the numeric data that I want to use to predict these classes.

from sklearn import svm
from sklearn import metrics
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score


filename = "Features.csv"
dataset = np.loadtxt(filename, delimiter=',', skiprows=1, usecols=range(1, 39))

x = dataset[:, 0:36]
y = dataset[:, 36]
print("len(x): " + str(len(x)))
print("len(y): " + str(len(x)))

skf = StratifiedKFold(n_splits=10, shuffle=False, random_state=42)

modelsvm = svm.SVC()

expected = y
print("len(expected): " + str(len(expected)))

predictedsvm = cross_val_score(modelsvm, x, y, cv=skf)
print("len(predictedsvm): " + str(len(predictedsvm)))

svm_results = metrics.classification_report(expected, predictedsvm)

print(svm_results)

And I am getting such an error:

len(x): 2069
len(y): 2069
len(expected): 2069
C:\Python\Python37\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning: Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.
  FutureWarning
len(predictedsvm): 10
Traceback (most recent call last):
  File "C:/Users/MyComp/PycharmProjects/GG/AR.py", line 54, in 
    svm_results = metrics.classification_report(expected, predictedsvm)
  File "C:\Python\Python37\lib\site-packages\sklearn\utils\validation.py", line 73, in inner_f
    return f(**kwargs)
  File "C:\Python\Python37\lib\site-packages\sklearn\metrics\_classification.py", line 1929, in classification_report
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
  File "C:\Python\Python37\lib\site-packages\sklearn\metrics\_classification.py", line 81, in _check_targets
    check_consistent_length(y_true, y_pred)
  File "C:\Python\Python37\lib\site-packages\sklearn\utils\validation.py", line 257, in check_consistent_length
    " samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [2069, 10]

Process finished with exit code 1

I don't understand how my data count in y goes down to 10 when I am trying to predict it using CV.

Can anyone help me on this please?

Nick Becker · Accepted Answer

You are misunderstanding the output from cross_val_score. As per the documentation it returns "array of scores of the estimator for each run of the cross validation," not actual predictions. Because you have 10 folds, you get 10 values.

classification_report expects the true values and the predicted values. To use this, you'll want to predict with a model. To do this, you'll need to fit the model on the data. If you're happy with the results from cross_val_score you can train that model on the data. Or, you can use GridSearchCV to do this all in one sweep.

Cross validation inconsistent numbers of samples error (Python)

Answers (1)

Related Questions