GIRISH kuniyal
GIRISH kuniyal

Reputation: 770

How to handle multi-label categorical feature for binary classification problem?

I have dataset like :

   profile     category  target
0        1      [5, 10]       1
1        2          [1]       0
2        3   [23, 5000]       1
3        4  [700, 4500]       0

How to handle category feature, this table may have others additional features too. One hot encoding lead to consume too much space.because number of rows is around 10 million. Any suggestion would be helpful.

Upvotes: 0

Views: 567

Answers (2)

GIRISH kuniyal
GIRISH kuniyal

Reputation: 770

MultiLabelBinarizer is solution for this kind of problem which gave sparse output low in memory you can convert other feature to sparse matrix than combine all features to feed into Machine learning model.

source

Upvotes: 0

PV8
PV8

Reputation: 6260

My idea would be to split this array into new columns:

this would lead to the following dataframe:

   profile     0    1  target
0        1     5    10       1
1        2     1             0
2        3     23   5000       1
3        4     700  4500       0

In the next step you can adjust it, that the categories getting to features (filled by 1 if the profile has this category), based on this, this will lead to the following dataframe:

   profile     1  ...  5  ... 10 ... 23 target
0        1     0       1       0      0      1
1        2     1       0       0      0      0
2        3     0       0       0      1      1
3        4     0       0       0      0      0

You will have every category as a feature, which can help you (it is similar to text classification problems then). Then you can use some techniques for dimension reduction like pca.

With this approach you are respecting the category behavor and could reduce your dimension later on with some maths techniques.

Upvotes: 1

Related Questions