Reputation: 770
I have dataset like :
profile category target
0 1 [5, 10] 1
1 2 [1] 0
2 3 [23, 5000] 1
3 4 [700, 4500] 0
How to handle category feature, this table may have others additional features too. One hot encoding lead to consume too much space.because number of rows is around 10 million. Any suggestion would be helpful.
Upvotes: 0
Views: 567
Reputation: 770
MultiLabelBinarizer is solution for this kind of problem which gave sparse output low in memory you can convert other feature to sparse matrix than combine all features to feed into Machine learning model.
Upvotes: 0
Reputation: 6260
My idea would be to split this array into new columns:
this would lead to the following dataframe:
profile 0 1 target
0 1 5 10 1
1 2 1 0
2 3 23 5000 1
3 4 700 4500 0
In the next step you can adjust it, that the categories getting to features (filled by 1 if the profile has this category), based on this, this will lead to the following dataframe:
profile 1 ... 5 ... 10 ... 23 target
0 1 0 1 0 0 1
1 2 1 0 0 0 0
2 3 0 0 0 1 1
3 4 0 0 0 0 0
You will have every category as a feature, which can help you (it is similar to text classification problems then). Then you can use some techniques for dimension reduction like pca.
With this approach you are respecting the category behavor and could reduce your dimension later on with some maths techniques.
Upvotes: 1