Reputation: 81
I want to use Random Forest for feature selection based on Gini index. My dataset has mix of numeric (contiuous) and categorical(String) data. This is an example of the dataset
Var1 Var2
198 zcROj17IEC
336 DHeTmBftjz
252.3 crIgUHSK8h
252 ZSNrjIX0Db
I know trees works on discrete data (categorical) but does RandomForest in Sklearn require continuous numeric data to be discretized first or it can handle it?? For categorical string variables I used the following to encode the strings into numeric columns with zeros and ones
pandas.get_dummies(X['Var2'])
and it works but for the numeric I tried the following to discretize
pandas.qcut(X['Var1'], 2 , retbins=True)
but I keep getting an error of non-unique bins!
Do I need to discretize? How can I do it?
Upvotes: 1
Views: 2772
Reputation: 1009
Trees and Forest work worse when you make dummies from you categorical values.
You need just label you categorical features - that's all!
Upvotes: 0
Reputation: 386
Random forest should support continuous variables no problem. See for example this sample.
Upvotes: 1