How is rawPrediction calculated in PySpark RandomForest implementation?

Question

I have trained an RF model (with 3 trees and the depth of 4) on a training set of 15 examples. Below is the image of how the three trees look. I have two classes (say 0 and 1).

The thresholds are mentioned on the left branch whereas the number in circles (for example 7, 3 are the number of examples that were <= threshold and > threshold for the feature 2 i.e. f2).

Now when I try to apply the model on the test set of 10 examples, I am not sure how the raw prediction is calculated.

+-----+----+----+----------+-------------------------------------------------------------------
|prediction|features                                                                                                                                                                                                                                                                                                   |rawPrediction|probability                            |
+-----+----+----+----------+-----------------------------------------------------------------------------------------------------------+-------------+---------------------------------------+
|1.0       |[0.07707524933080619,0.03383458646616541,0.017208413001912046,9.0,2.5768015000258258,0.0,-1.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0014143059186559938,0.0,0.6666666666666667,7.076533785087878E-4,0.0014163090128755495,0.9354143466934853,0.9333333333333333,0.875,0.938888892531395,7.0]                 |[1.0,2.0]    |[0.3333333333333333,0.6666666666666666]|

I have gone through the following links out there to understand but I am not able to get my head around this.

https://forums.databricks.com/questions/14355/how-does-randomforestclassifier-compute-the-rawpre.html

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala

I know for sure that it's not as straightforward as you think. For example from my understanding, it is not like - for example, if two trees predict 0 and one tree predict as 1 then the raw prediction would be [2, 1]. This is not the case because when I train the model on 500 examples I see the raw prediction for the same example to be [0.9552544653780279,2.0447455346219723].

Can someone explain this to me how mathematically this is being calculated? Any help would be appreciated here as it's kinda basic and I want to get my head straight on how this is working. Again thanks a lot in advance and please post if there is any other information required to help resolve this.

Edit: Adding the data from the model as well:

+------+-----------------------------------------------------------------------------------------------------+
|treeID|nodeData                                                                                             |
+------+-----------------------------------------------------------------------------------------------------+
|0     |[0, 0.0, 0.5, [9.0, 9.0], 0.19230769230769235, 1, 4, [2, [0.12519961673586713], -1]]                 |
|0     |[1, 0.0, 0.42603550295857984, [9.0, 4.0], 0.42603550295857984, 2, 3, [20, [0.39610389610389607], -1]]|
|0     |[2, 0.0, 0.0, [9.0, 0.0], -1.0, -1, -1, [-1, [], -1]]                                                |
|0     |[3, 1.0, 0.0, [0.0, 4.0], -1.0, -1, -1, [-1, [], -1]]                                                |
|0     |[4, 1.0, 0.0, [0.0, 5.0], -1.0, -1, -1, [-1, [], -1]]                                                |
|1     |[0, 1.0, 0.4444444444444444, [5.0, 10.0], 0.4444444444444444, 1, 2, [4, [0.9789660448762616], -1]]   |
|1     |[1, 1.0, 0.0, [0.0, 10.0], -1.0, -1, -1, [-1, [], -1]]                                               |
|1     |[2, 0.0, 0.0, [5.0, 0.0], -1.0, -1, -1, [-1, [], -1]]                                                |
|2     |[0, 0.0, 0.48, [3.0, 2.0], 0.48, 1, 2, [20, [0.3246753246753247], -1]]                               |
|2     |[1, 0.0, 0.0, [3.0, 0.0], -1.0, -1, -1, [-1, [], -1]]                                                |
|2     |[2, 1.0, 0.0, [0.0, 2.0], -1.0, -1, -1, [-1, [], -1]]                                                |
+------+-----------------------------------------------------------------------------------------------------+

Shaido · Accepted Answer

The raw prediction is the predicted class probabilities for each tree, summed over all trees in the forest. For the class probabilities for a single tree, the number of samples belonging to each class in the selected leaf node matters.

In the code, we can see this procedure in the RandomForestClassifier class here, the relevant code is quoted here:

override protected def predictRaw(features: Vector): Vector = {
  // TODO: When we add a generic Bagging class, handle transform there: SPARK-7128
  // Classifies using majority votes.
  // Ignore the tree weights since all are 1.0 for now.
  val votes = Array.fill[Double](numClasses)(0.0)
  _trees.view.foreach { tree =>
    val classCounts: Array[Double] = tree.rootNode.predictImpl(features).impurityStats.stats
    val total = classCounts.sum
    if (total != 0) {
      var i = 0
      while (i < numClasses) {
        votes(i) += classCounts(i) / total
        i += 1
      }
    }
  }
  Vectors.dense(votes)
}

For each tree, we find the leaf node corresponding to the input feature and find the count for each class (the class count corresponds to the number of training samples of the class assigned to the leaf node during training). The class counts are divided by the total count of the node to get class probabilities.

Now, for each tree, we have the probability of the input features belonging to each class. These probabilities are summed to get the raw predictions.

More specific for this question, from my understanding the main missing element is the class count (and from that, class probabilities). To calculate the raw prediction, these are an essential component. In the image, instead of having "Pred 0" and "Pred 1", you need to add the number of samples during training that gets assigned to each leaf ("Pred 0" would mean that the number of samples from class 0 are a majority, and vice versa). When you know the class counts and class probabilities, sum these for all the trees and you get the raw prediction.

How is rawPrediction calculated in PySpark RandomForest implementation?

Answers (1)

Related Questions