w00dy
w00dy

Reputation: 758

Random Forest feature importance: how many are actually used?

I use RF twice in a row.

First, I fit it using max_features='auto' and the whole dataset (109 feature), in order to perform features selection. The following is RandomForestClassifier.feature_importances_, it correctly gives me 109 score per each feature:

[0.00118087,  0.01268531,  0.0017589 ,  0.01614814,  0.01105567,
0.0146838 ,  0.0187875 ,  0.0190427 ,  0.01429976,  0.01311706,
0.01702717,  0.00901344,  0.01044047,  0.00932331,  0.01211333,
0.01271825,  0.0095337 ,  0.00985686,  0.00952823,  0.01165877,
0.00193286,  0.0012602 ,  0.00208145,  0.00203459,  0.00229907,
0.00242616,  0.00051358,  0.00071606,  0.00975515,  0.00171034,
0.01134927,  0.00687018,  0.00987706,  0.01507474,  0.01223525,
0.01170495,  0.00928417,  0.01083082,  0.01302036,  0.01002457,
0.00894818,  0.00833564,  0.00930602,  0.01100774,  0.00818604,
0.00675784,  0.00740617,  0.00185461,  0.00119627,  0.00159034,
0.00154336,  0.00478926,  0.00200773,  0.00063574,  0.00065675,
0.01104192,  0.00246746,  0.01663812,  0.01041134,  0.01401842,
0.02038318,  0.0202834 ,  0.01290935,  0.01476593,  0.0108275 ,
0.0118773 ,  0.01050919,  0.0111477 ,  0.00684507,  0.01170021,
0.01291888,  0.00963295,  0.01161876,  0.00756015,  0.00178329,
0.00065709,  0.        ,  0.00246064,  0.00217982,  0.00305187,
0.00061284,  0.00063431,  0.01963523,  0.00265208,  0.01543552,
0.0176546 ,  0.01443356,  0.01834896,  0.01385694,  0.01320648,
0.00966011,  0.0148321 ,  0.01574166,  0.0167107 ,  0.00791634,
0.01121442,  0.02171706,  0.01855552,  0.0257449 ,  0.02925843,
0.01789742,  0.        ,  0.        ,  0.00379275,  0.0024365 ,
0.00333905,  0.00238971,  0.00068355,  0.00075399]

Then, I transform the dataset over the previous fit which should reduce its dimensionality, and then i re-fit RF over it. Given max_features='auto' and the 109 feats, I would expect to have in total ~10 features instead, calling rf.feats_importance_, returns more (62):

[ 0.01261971, 0.02003921, 0.00961297, 0.02505467, 0.02038449,
0.02353745, 0.01893777, 0.01932577, 0.01681398, 0.01464485,
0.01672119, 0.00748981, 0.01109461, 0.01116948, 0.0087081 ,
0.01056344, 0.00971319, 0.01532258, 0.0167348 , 0.01601214,
0.01522208, 0.01625487, 0.01653784, 0.01483562, 0.01602748,
0.01522369, 0.01581573, 0.01406688, 0.01269036, 0.00884105,
0.02538574, 0.00637611, 0.01928382, 0.02061512, 0.02566056,
0.02180902, 0.01537295, 0.01796305, 0.01171095, 0.01179759,
0.01371328, 0.00811729, 0.01060708, 0.015717 , 0.01067911,
0.01773623, 0.0169396 , 0.0226369 , 0.01547827, 0.01499467,
0.01356075, 0.01040735, 0.01360752, 0.01754145, 0.01446933,
0.01845195, 0.0190799 , 0.02608652, 0.02095663, 0.02939744,
0.01870901, 0.02512201]

Why? Shouldn't it returns just ~10 features importances?

Upvotes: 0

Views: 811

Answers (1)

yangjie
yangjie

Reputation: 6715

You misunderstood the meaning of max_features, which is

The number of features to consider when looking for the best split

It is not the number of features when transforming the data.

It is the threshold in transform method that determines the most important features.

threshold : string, float or None, optional (default=None)

The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if available, the object attribute threshold is used. Otherwise, “mean” is used by default.

Upvotes: 4

Related Questions