Reputation: 758
I use RF twice in a row.
First, I fit it using max_features='auto'
and the whole dataset (109 feature), in order to perform features selection.
The following is RandomForestClassifier.feature_importances_
, it correctly gives me 109 score per each feature:
[0.00118087, 0.01268531, 0.0017589 , 0.01614814, 0.01105567,
0.0146838 , 0.0187875 , 0.0190427 , 0.01429976, 0.01311706,
0.01702717, 0.00901344, 0.01044047, 0.00932331, 0.01211333,
0.01271825, 0.0095337 , 0.00985686, 0.00952823, 0.01165877,
0.00193286, 0.0012602 , 0.00208145, 0.00203459, 0.00229907,
0.00242616, 0.00051358, 0.00071606, 0.00975515, 0.00171034,
0.01134927, 0.00687018, 0.00987706, 0.01507474, 0.01223525,
0.01170495, 0.00928417, 0.01083082, 0.01302036, 0.01002457,
0.00894818, 0.00833564, 0.00930602, 0.01100774, 0.00818604,
0.00675784, 0.00740617, 0.00185461, 0.00119627, 0.00159034,
0.00154336, 0.00478926, 0.00200773, 0.00063574, 0.00065675,
0.01104192, 0.00246746, 0.01663812, 0.01041134, 0.01401842,
0.02038318, 0.0202834 , 0.01290935, 0.01476593, 0.0108275 ,
0.0118773 , 0.01050919, 0.0111477 , 0.00684507, 0.01170021,
0.01291888, 0.00963295, 0.01161876, 0.00756015, 0.00178329,
0.00065709, 0. , 0.00246064, 0.00217982, 0.00305187,
0.00061284, 0.00063431, 0.01963523, 0.00265208, 0.01543552,
0.0176546 , 0.01443356, 0.01834896, 0.01385694, 0.01320648,
0.00966011, 0.0148321 , 0.01574166, 0.0167107 , 0.00791634,
0.01121442, 0.02171706, 0.01855552, 0.0257449 , 0.02925843,
0.01789742, 0. , 0. , 0.00379275, 0.0024365 ,
0.00333905, 0.00238971, 0.00068355, 0.00075399]
Then, I transform the dataset over the previous fit which should reduce its dimensionality, and then i re-fit RF over it.
Given max_features='auto'
and the 109 feats, I would expect to have in total ~10 features instead, calling rf.feats_importance_
, returns more (62):
[ 0.01261971, 0.02003921, 0.00961297, 0.02505467, 0.02038449,
0.02353745, 0.01893777, 0.01932577, 0.01681398, 0.01464485,
0.01672119, 0.00748981, 0.01109461, 0.01116948, 0.0087081 ,
0.01056344, 0.00971319, 0.01532258, 0.0167348 , 0.01601214,
0.01522208, 0.01625487, 0.01653784, 0.01483562, 0.01602748,
0.01522369, 0.01581573, 0.01406688, 0.01269036, 0.00884105,
0.02538574, 0.00637611, 0.01928382, 0.02061512, 0.02566056,
0.02180902, 0.01537295, 0.01796305, 0.01171095, 0.01179759,
0.01371328, 0.00811729, 0.01060708, 0.015717 , 0.01067911,
0.01773623, 0.0169396 , 0.0226369 , 0.01547827, 0.01499467,
0.01356075, 0.01040735, 0.01360752, 0.01754145, 0.01446933,
0.01845195, 0.0190799 , 0.02608652, 0.02095663, 0.02939744,
0.01870901, 0.02512201]
Why? Shouldn't it returns just ~10 features importances?
Upvotes: 0
Views: 811
Reputation: 6715
You misunderstood the meaning of max_features
, which is
The number of features to consider when looking for the best split
It is not the number of features when transforming the data.
It is the threshold
in transform
method that determines the most important features.
threshold : string, float or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if available, the object attribute threshold is used. Otherwise, “mean” is used by default.
Upvotes: 4