moufkir
moufkir

Reputation: 81

Random Forest Classifier :To which class corresponds the probabilities

I am using the RandomForestClassifier from pyspark.ml.classification

I run the model on a binary class dataset and display the probabilities.

I have the following in the col probabilities :

+-----+----------+---------------------------------------+
|label|prediction|probability                            |
+-----+----------+---------------------------------------+
|0.0  |0.0       |[0.9005918461098429,0.0994081538901571]|
|1.0  |1.0       |[0.6051335859900139,0.3948664140099861]|
+-----+----------+---------------------------------------+

I have a list of 2 elements which obviously correspond to the probabilities of the predicted class.

My question : probability[0 corresponds always to the value of prediction whereas in the spark documentation it is not clear!

Upvotes: 3

Views: 1071

Answers (2)

Anneso
Anneso

Reputation: 613

I post almost the same question here and I think the answer might help you: Scala: how to know which probability correspond to which class?

The answer is before the fit of the model.

To fit the model we use a labelIndexer on the target. This label indexer transform the target into an indexe by descending frequency.

ex: if, in my target I have 20% of "aa" and 80% of "bb" label indexer will create a column "label" that took the value 0 for "bb" and 1 for "aa" (because I "bb" is ore frequent than "aa")

When we fit a random forest, the probabilities correspond to the order of frequency.

In binary classification:

  • first proba = probability that the class is the most frequent class in the train set
  • second proba = probability that the class is the less frequent class in the train set

Upvotes: 0

Savage Henry
Savage Henry

Reputation: 2069

I am interpreting your question as asking: does the first element in the array under the column 'predictions' always correspond to the "predicted class", by which you mean the label the Random Forest Classifier predicted the observation should have.

If I have that correct, the answer is Yes.

The items in the arrays in both probability rows can be read as the model telling you:

['My confidence that the predicted label = the true label', 'My confidence that the label != the true label']

In the case of multiple labels being predicted, then you would have the model telling you:

['My confidence that the label I predict = specific label 1', 'My confidence that the label I predict = specific label 2', ...'My confidence that the label I predict = specific label N']

This is indexed by the N labels you are trying to predict (which means you have to be careful about the way the labels are structured).

Perhaps it would help to take a look at this answer. You could do something like:

model = pipeline.fit(trainig_data) predictions = model.transform(test_data) print predictions.show(10)

(Using the relevant pipeline and data from your examples.)

This will show you the probabilities for each class.

Upvotes: 1

Related Questions