Jim GB
Jim GB

Reputation: 93

Categorical data transformation in Scikit-Learn

I have a 40 million x 22 numpy array of integer data for a classification task. Most of the features are categorical data that use different integer values to represent different categories. For example, in the column "Color": 0 means blue, 1 means red and so on. I have preprocessed the data using LabelEncoder.

  1. Does it make sense to fit those data into any classification model in SK-learn? I tried to fit the data into Random Forest model but got extremely bad accuracy. I also tried One Hot Encoding to transform the data into dummy variables, but my computer can only deal with a sparse matrix after using One Hot Encoding, the problem is that Random Forest can only take a dense matrix, which will exceed my computer's memory.
  2. What's the correct strategy to deal with categorical data in SK-learn?

Upvotes: 2

Views: 1292

Answers (1)

Artem Sobolev
Artem Sobolev

Reputation: 6069

LabelEncoder is useless in your case, since output numbers do not make any sense as numbers (i.e. it's meaningless to perform arithmetic operations on them). OneHotEncoder is essential when dealing with categorical data.

Recently sklearn got support for sparse input in Random Forests and Decision Trees, so you might want to check out the latest version. Also, other methods like LogisticRegression support sparse data.

Moreover, I don't think you need to use all 40M of examples to get a decent accuracy. It should be enough to randomly sample, say, 100k of them (this number depends on number of features after OneHotEncoding, their variability, and number of target classes).

Upvotes: 1

Related Questions