sklearn train_test_split - ValueError: Found input variables with inconsistent numbers of samples

Question

I have a multi-label classification problem, for which I looked online and saw that for one-hot encoding the labels it is best to use the MultiLabelBinarizer.

I use this for my labels (which i separate from the dataset itself) as follows:

ohe = MultiLabelBinarizer()
labels = ohe.fit_transform(labels)
train, test, train_labels, test_labels = train_test_split(dataset, labels, test_size=0.2) #80% train split

But it throws me this following error:

Traceback (most recent call last): 
  File "learn.py", line 114, in  
    train, test, train_labels, test_labels = train_test_split(dataset, labels, test_size=0.2) #80% train split
  File "C:\Users\xwb18152\AppData\Roaming\Python\Python38\site-packages\sklearn\model_selection\_split.py", line 2127, 
in train_test_split
    arrays = indexable(*arrays)
  File "C:\Users\xwb18152\AppData\Roaming\Python\Python38\site-packages\sklearn\utils\validation.py", line 293, in indexable
    check_consistent_length(*result)
  File "C:\Users\xwb18152\AppData\Roaming\Python\Python38\site-packages\sklearn\utils\validation.py", line 256, in check_consistent_length
    raise ValueError("Found input variables with inconsistent numbers of"
ValueError: Found input variables with inconsistent numbers of samples: [83292, 5]

--

EDIT: The labels dataset looks as follows (ignore the Interval column, this shouldnt be there and is not actually counted in the rows -- not sure why?):

          Movement  Distance  Speed  Delay  Loss 
Interval
0                1         1     25      0     0
2                1         1     25      0     0
4                1         1     25      0     0
6                1         1     25      0     0
8                1         1     25      0     0
...            ...       ...    ...    ...   ...
260              3         5     50      0     0
262              3         5     50      0     0
264              3         5     50      0     0
266              3         5     50      0     0
268              3         5     50      0     0

From this we can see that it is a multi-label multi-class classification problem. The shape of the dataset and labels before and after the Binarizer are as follows:

             Before             After
dataset      (83292, 15)        (83292, 15)
labels       (83292, 5)         (5, 18)

Narendra Prasath · Accepted Answer

As you stated, labels orginal shape is (83292, 5) and once you applied MultiLabelBinarizer it became (5, 18).

train_test_split(X, y) function expect that X and y should have the same rows. E.g: 83292 datapoints available in your X and respective datapoints label should be available in your y variable. Hence, in your case it should be X and y shape should be (83292, 15) and (83292, 18).

Try this: Your MultiLabelBinarizer output having wrong dimension here. So, if your labels is a dataframe object, then you should apply mlb.fit_transform(labels.values.tolist()). this would produce the same no of rows as output here 83292.

Example of your labels should be like below format:

your y input can be like list of list or dataframe having one column which having list of values. Make sure you have X and y having same no of rows. You can represent multi-label multi-class y variable like below format. Or dataframe.shape should be (no_of_rows, 1)

[[1, 1, 25, 0, 0],
 [1, 1, 25, 0, 0],
 [1, 1, 25, 0, 0],
 [1, 1, 25, 0, 0],
 [1, 1, 25, 0, 0],
 [3, 5, 50, 0, 0],
 [3, 5, 50, 0, 0],
 [3, 5, 50, 0, 0],
 [3, 5, 50, 0, 0],
 [3, 5, 50, 0, 0]]

sklearn train_test_split - ValueError: Found input variables with inconsistent numbers of samples

Answers (2)

Related Questions