Random forest in sklearn

Question

I was trying to fit a random forest model using the random forest classifier package from sklearn. However, my data set consists of columns with string values ('country'). The random forest classifier here does not take string values. It needs numerical values for all the features. I thought of getting some dummy variables in place of such columns. But, I am confused as to how will the feature importance plot now look like. There will be variables like country_India, country_usa etc. How can get the consolidated importance of the country variable as I would get if I had done my analysis using R.

lejlot · Accepted Answer

You will have to do it by hand. There is no support in sklearn for mapping classifier specific methods through inverse transform of feature mappings. R is calculating importances based on multi-valued splits (as @Soren explained) - when using scikit-learn you are limtied to binary splits and you have to approximate actual importance. One of the simpliest solutions (although biased) is to store which features are actually binary encodings of your categorical variable and sum these resulting elements from feature importance vector. This will not be fully justified from mathematical perspective, but the simpliest thing to do to get some rough estimate. To do it correctly you should reimplement feature importance from scratch, and simply during calculation "for how many samples the feature is active during classification", you would have to use your mapping to correctly asses each sample only once to the actual feature (as adding dummy importances will count each dummy variable on the classification path, and you want to do min(1, #dummy on path) instead).

Random forest in sklearn

Answers (2)

Related Questions