Sara
Sara

Reputation: 81

Discretizing continuous variables for RandomForest in Sklearn

I want to use Random Forest for feature selection based on Gini index. My dataset has mix of numeric (contiuous) and categorical(String) data. This is an example of the dataset

Var1 Var2
198 zcROj17IEC 336 DHeTmBftjz 252.3 crIgUHSK8h 252 ZSNrjIX0Db

I know trees works on discrete data (categorical) but does RandomForest in Sklearn require continuous numeric data to be discretized first or it can handle it?? For categorical string variables I used the following to encode the strings into numeric columns with zeros and ones

pandas.get_dummies(X['Var2'])

and it works but for the numeric I tried the following to discretize

pandas.qcut(X['Var1'], 2 , retbins=True) 

but I keep getting an error of non-unique bins!

Do I need to discretize? How can I do it?

Upvotes: 1

Views: 2772

Answers (2)

andrewchauzov
andrewchauzov

Reputation: 1009

Trees and Forest work worse when you make dummies from you categorical values.

You need just label you categorical features - that's all!

Upvotes: 0

Bennet
Bennet

Reputation: 386

Random forest should support continuous variables no problem. See for example this sample.

Upvotes: 1

Related Questions