Estimating class probabilities with hierarchical random forest models

Question

I am using a Random Forest classifier (in R) to predict the spatial distribution of multiple native plant communities using a variety of environmental variables as predictors. This classification system is hierarchical with each successive level becoming more detailed in its class description. For example, I have a hierarchical classification system with 2 levels and the upper most level consists of two classes: Forest (F) and Grassland (G). Lets say for the second level each Forest and Grassland class is composed of 2 subclasses (F1,F2 and G1,G2). Using the Forest class for example, subclasses might be Conifer or Deciduous Forests.

I know this is pretty basic so far, but here's the challenge I've run into. I'd like to predict the spatial distribution of these classes at the finest classification level but there is too much environmental variation to do this with acceptable accuracy. To reduce this variability I can train multiple Random Forest models where the first model (model #1) operates at the uppermost level classifying observations into either F or G. At the second level, subset data into two groups based on their F/G class and train two models (models #2 and #3) each classifying a subset into respective subclasses.

Using these stacked models, I predict the class probability of a new observation. Using Random Forests, this value is the number of trees voting for a particular class divided by the number of trees in the forest. For a single new observation a summarized Random Forest output might be:

Level 1 (Model #1)
- F, G = 80, 20

Level 2 (Models #2 and #3)
- F1, F2 = 80, 20
- G1, G2 = 70, 30

The output suggests this new observation is most likely a Forest with a subclass of F1, but how confident am I F1 is the correct class?

My questions are firstly, is there an appropriate method for calculating the combined probability of this new observation being actually F1 given this modeling structure? Secondly, if appropriate, how? (I suspect some sort of Bayesian approach using upper level probabilities as priors might work but I'm far from proficient in Bayesian statistics).

I apologize for my verbosity and for not posting actual data/code (its hard to extract something both succinct and representative of my issues given my dataset). Thanks!

Yoni Gavish · Accepted Answer

I'm actually working on a similar issue and have codified an R package that runs randomForest as the local classifier along a pre-defined class hierarchy. you can find it in R-Forge under 'hie-ran-forest'. The package includes two ways to turn the local probabilities into a crisp class.

stepwise majority rule- choose the class with the highest proportion of votes in your level 1 model, then choose the class with the highest proportion of votes in your second level model
multiplicative majority rule- multiply the probabilities (proportion of votes) down the class hierarchy and choose the class with the highest multiplicative proportion of votes.

In the example you provided, both methods will end with F1, yet for the values:

F, G   = 0.6,  0.4
F1, F2 = 0.6,  0.4 
G1, G2 = 0.95, 0.05

the stepwise majority will choose F1 (F in model 1 and F1 in model 2) while the multiplicative will choose G1 since

0.4*0.95 (G1) > 0.6*0.6 (F1) > 0.6*0.4 (F2) > 0.4*0.05 (G2)

I don't think there is a 'correct' option and in general I find that the two methods usually reach very similar accuracy levels. The stewpwise is more sensitive to mis-classification near the root of the tree. yet if your model 1 is correct, it will tend to make less 'serious' mis-classification. On the other hand, the multiplicative is less sensitive to the results of any specific local classifier but is sensitive to the depth of the class hierarchy and to the number of sibling in each local classifier.

Estimating class probabilities with hierarchical random forest models

Answers (1)

Related Questions