Create a dataset for multi-labelled classification

Question

I have a dataset of the following form:

Id  Class

1   a
2   b
2   c
3   c
3   d
3   a
3   e
3   f
4   g

And I need to prep this data to perform a multi-label classification so I use:

df.groupby("Id").Class.apply(','.join).reset_index()

to get:

Id  Class

1   a
2   b,c
3   c,d,e,f
4   g

Now the MultiLabelBinarizer is unable to process this in its current form because df.Class is represented as

("a", "b,c", "c,d,e,f", "g")

however, it is supposed to be in the form

[["a"], ["b","c"], ["c","d","e","f"],["g"]]

How should I go about it?

jezrael · Accepted Answer

You need apply list:

print (df.groupby("Id").Class.apply(list))
Id
1                [a]
2             [b, c]
3    [c, d, a, e, f]
4                [g]
Name: Class, dtype: object

Create a dataset for multi-labelled classification

Answers (1)

Related Questions