Felicia.H
Felicia.H

Reputation: 361

pandas get_dummies on high cardinality variables using one hot encoding creates too many new features

I have several high cardinal variables in a dataset and want to convert them into dummies. All of them have more than 500 levels. When I used pandas get_dummies, the matrix got so large and my program crashed.

pd.get_dummies(data, sparse=True, drop_first=True, dummy_na=True)

I don't know better ways to handle high cardinal variables besides using one hot encoding, but it increases the size of the data so much that the memory can't handle it. Does anyone have better solutions?

Upvotes: 0

Views: 1393

Answers (1)

abrocod
abrocod

Reputation: 31

  • Method 1: For non-linear algorithms like RF you can also replace a categorical variable by the number of times it appears in the train set. This turns it into a single feature.

  • Method 2: If you can make one-hot encoding fit into your memory, you can consider first applying one-hot encoding, and then apply some dimensional reduction method (like PCA) or embedding method (word2Vec, etc) to reduce the dimension, before you fit them into any ML algorithm.

There are more discussion here: https://www.kaggle.com/general/16927

Upvotes: 2

Related Questions