Reputation: 361
I have several high cardinal variables in a dataset and want to convert them into dummies. All of them have more than 500 levels. When I used pandas get_dummies, the matrix got so large and my program crashed.
pd.get_dummies(data, sparse=True, drop_first=True, dummy_na=True)
I don't know better ways to handle high cardinal variables besides using one hot encoding, but it increases the size of the data so much that the memory can't handle it. Does anyone have better solutions?
Upvotes: 0
Views: 1393
Reputation: 31
Method 1: For non-linear algorithms like RF you can also replace a categorical variable by the number of times it appears in the train set. This turns it into a single feature.
Method 2: If you can make one-hot encoding fit into your memory, you can consider first applying one-hot encoding, and then apply some dimensional reduction method (like PCA) or embedding method (word2Vec, etc) to reduce the dimension, before you fit them into any ML algorithm.
There are more discussion here: https://www.kaggle.com/general/16927
Upvotes: 2