Reputation: 147
I am trying to figure it out which features are being considered in each subsampling in my classification problem, for this, I am assuming that there is a random subset of features of length max_features
that is considered when building every tree.
I am interested in this because I am using two different types of features for my problem so I want to make sure that in every tree both types of features are being used for every node split. So one way to at least make each tree to consider all features is by setting the max_features
parameter to None
. So one question here would be:
Does that mean that both types of features are being considered for every node split?
Another one derived from the previous question is:
Since Random Forest make a subsampling for every tree, is this subsampling among cases (rows) or among columns (features) as well? Besides, can this subsampling be done by group of rows instead of randomly?
Besides, it does not seem to be a good assumption to use all the features in the max_features
parameter neither on Decision Trees
nor on random forest
since it is opposite to the whole point and definition of random forest
in terms of correlation among trees (I am not completely sure about this statement).
Does anyone know if this is something that can be modified in the source code or if at least it can be approached differently?
Any suggestion or comment is very welcomed.
Feel free to correct any assumption.
In the source code I have been reading about this but could not find where this might be defined.
Source code inspected so far:
splitter.py code from decision tree
forest.py code from random forest
Upvotes: 0
Views: 731
Reputation: 60318
Does that mean that both types of features are being considered for every node split?
Given that you have correctly pointed out above that setting max_features
to None
will indeed force the algorithm to consider all features in every split, it is unclear what exactly you are asking here: all means all and, from the point of view of the algorithm, there are not different "types" of features.
Since Random Forest make a subsampling for every tree, is this subsampling among cases (rows) or among columns (features) as well?
Both. But, regarding the rows, it is not exactly subsampling, it is actually bootstrap sampling, i.e. sampling with replacement, which means that, in each sample, some rows will be missing while others will be present multiple times.
Random forest is in fact the combination of two independent ideas: bagging, and random selection of features. The latter corresponds essentially to "column subsampling", while the former includes the bootstrap sampling I have just described.
Besides, can this subsampling be done by group of rows instead of randomly?
AFAIK no, at least in the standard implementations (including scikit-learn).
Does anyone know if this is something that can be modified in the source code or if at least it can be approached differently?
Everything can be modified in the source code, literally; now, if it is really necessary (or even a good idea) to do so is a different story...
Besides, it does not seem to be a good assumption to use all the features in the
max_features
parameter
It does not indeed, as this is the very characteristic that discriminates RF from the simpler bagging approach (short for bootstrap aggregating). Experiments have indeed shown that adding this random selection of features at each step boosts performance related to simple bagging.
Although your question (and issue) sounds rather vague, my advice would be to "sit back and relax", letting the (powerful enough) RF algorithm do its job with your data...
Upvotes: 2