Rodrigo Laguna
Rodrigo Laguna

Reputation: 1850

ValueError: Found input variables with inconsistent numbers of samples

There are tons of samples from this error in which the problem is related with dimensions of the array or how a dataframe is read. However, I'm using just a python list for both X and Y.

I'm trying to split my code in train and test with train_test_split.

My code is this:

X, y = file2vector(corpus_dir)
assert len(X) == len(y) # both lists same length
print(type(X))
print(type(y))
seed = 123
labels = list(set(y))
print(len(labels))
print(labels)
cont = {}
for l in y:
    if not l in cont:
        cont[l] = 1
    else:
        cont[l] += 1

print(cont)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=seed, stratify=labels)

Output is:

<class 'list'> # type(X)
<class 'list'> # type(y)
2 # len(labels)
['I', 'Z'] # labels
{'I': 18867, 'Z': 13009} # cont

X and y are just Python lists of Python strings that I read from a file with file2vector. I'm running on python 3, and backtrace is the following:

Traceback (most recent call last):
  File "/home/rodrigo/idatha/no_version/imm/classifier.py", line 28, in <module> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=seed, stratify=labels)
  File "/home/rodrigo/idatha/no_version/imm/.env/lib/python3.5/site-packages/sklearn/model_selection/_split.py", line 2056, in train_test_split train, test = next(cv.split(X=arrays[0], y=stratify))
  File "/home/rodrigo/idatha/no_version/imm/.env/lib/python3.5/site-packages/sklearn/model_selection/_split.py", line 1203, in split X, y, groups = indexable(X, y, groups)
  File "/home/rodrigo/idatha/no_version/imm/.env/lib/python3.5/site-packages/sklearn/utils/validation.py", line 229, in indexable check_consistent_length(*result)
  File "/home/rodrigo/idatha/no_version/imm/.env/lib/python3.5/site-packages/sklearn/utils/validation.py", line 204, in check_consistent_length " samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [31876, 2]

Upvotes: 2

Views: 19076

Answers (2)

Ilona Brinkmeier
Ilona Brinkmeier

Reputation: 11

Working with the figure 8 imbalanced disaster messages dataset on Python 3.7, scikit-learn 0.21.2, I had the same problem with train_test_split even with stratify=y. For me, the solution has been the param stratify=y.iloc[:,1] with having set y = df[df.columns[4:]] before. Perhaps this helps other ones too ...

Upvotes: 0

Grr
Grr

Reputation: 16079

The issue is with your labels list. Internally when stratify is provided to train_test_split the value gets passed as the y argument to the split method of an instance of StratifiedShuffleSplit. As you can see in the documentation for the split method y should be the same length as X (in this case the arrays you wish to split). So in order to fix your problem instead of passing stratify=labels just use stratify=y

Upvotes: 3

Related Questions