Reputation: 93
I have a 40 million x 22 numpy array of integer data for a classification task. Most of the features are categorical data that use different integer values to represent different categories. For example, in the column "Color": 0 means blue, 1 means red and so on. I have preprocessed the data using LabelEncoder.
Upvotes: 2
Views: 1292
Reputation: 6069
LabelEncoder
is useless in your case, since output numbers do not make any sense as numbers (i.e. it's meaningless to perform arithmetic operations on them). OneHotEncoder
is essential when dealing with categorical data.
Recently sklearn got support for sparse input in Random Forests and Decision Trees, so you might want to check out the latest version. Also, other methods like LogisticRegression support sparse data.
Moreover, I don't think you need to use all 40M of examples to get a decent accuracy. It should be enough to randomly sample, say, 100k of them (this number depends on number of features after OneHotEncoding, their variability, and number of target classes).
Upvotes: 1