Reputation: 691
I have a multi-label classification problem, for which I looked online and saw that for one-hot encoding the labels it is best to use the MultiLabelBinarizer
.
I use this for my labels (which i separate from the dataset itself) as follows:
ohe = MultiLabelBinarizer()
labels = ohe.fit_transform(labels)
train, test, train_labels, test_labels = train_test_split(dataset, labels, test_size=0.2) #80% train split
But it throws me this following error:
Traceback (most recent call last):
File "learn.py", line 114, in <module>
train, test, train_labels, test_labels = train_test_split(dataset, labels, test_size=0.2) #80% train split
File "C:\Users\xwb18152\AppData\Roaming\Python\Python38\site-packages\sklearn\model_selection\_split.py", line 2127,
in train_test_split
arrays = indexable(*arrays)
File "C:\Users\xwb18152\AppData\Roaming\Python\Python38\site-packages\sklearn\utils\validation.py", line 293, in indexable
check_consistent_length(*result)
File "C:\Users\xwb18152\AppData\Roaming\Python\Python38\site-packages\sklearn\utils\validation.py", line 256, in check_consistent_length
raise ValueError("Found input variables with inconsistent numbers of"
ValueError: Found input variables with inconsistent numbers of samples: [83292, 5]
--
EDIT: The labels dataset looks as follows (ignore the Interval
column, this shouldnt be there and is not actually counted in the rows -- not sure why?):
Movement Distance Speed Delay Loss
Interval
0 1 1 25 0 0
2 1 1 25 0 0
4 1 1 25 0 0
6 1 1 25 0 0
8 1 1 25 0 0
... ... ... ... ... ...
260 3 5 50 0 0
262 3 5 50 0 0
264 3 5 50 0 0
266 3 5 50 0 0
268 3 5 50 0 0
From this we can see that it is a multi-label multi-class classification problem. The shape
of the dataset
and labels
before and after the Binarizer are as follows:
Before After
dataset (83292, 15) (83292, 15)
labels (83292, 5) (5, 18)
Upvotes: 4
Views: 41495
Reputation: 1531
As you stated, labels orginal shape is (83292, 5)
and once you applied MultiLabelBinarizer
it became (5, 18)
.
train_test_split(X, y)
function expect that X and y should have the same rows. E.g: 83292
datapoints available in your X
and respective datapoints label should be available in your y
variable.
Hence, in your case it should be X
and y
shape should be (83292, 15)
and (83292, 18)
.
Try this:
Your MultiLabelBinarizer
output having wrong dimension here. So, if your labels
is a dataframe object, then you should apply mlb.fit_transform(labels.values.tolist())
.
this would produce the same no of rows as output here 83292
.
Example of your labels should be like below format:
your y
input can be like list of list
or dataframe having one column which having list of values
. Make sure you have X and y having same no of rows. You can represent multi-label multi-class y
variable like below format. Or dataframe.shape should be (no_of_rows, 1)
[[1, 1, 25, 0, 0],
[1, 1, 25, 0, 0],
[1, 1, 25, 0, 0],
[1, 1, 25, 0, 0],
[1, 1, 25, 0, 0],
[3, 5, 50, 0, 0],
[3, 5, 50, 0, 0],
[3, 5, 50, 0, 0],
[3, 5, 50, 0, 0],
[3, 5, 50, 0, 0]]
Upvotes: 4
Reputation: 36704
This means that the length of the various elements you're trying to split don't match.For X
and y
, sklearn
will take the same indices, usually a random sample of 80% of the indices of your data. So, the lengths have to match.
Imagine it's trying to keep these indices. What would sklearn
do when there's nothing at some index?
0 1 0 0 1 0 1 0 0 1 0 1 0 1
a b b a b a b a a b b b
^ ^ ^ ^ ^ ^ ^ ^
Do this check to verify that the lengths match. Does this return True
?
len(dataset) == len(labels)
Upvotes: 1