predict_proba(X) of RandomForestClassifier (sklearn) seems to be static?

Question

For all classes I want to retrieve the prediction-score/probability of a given sample. I'm using the RandomForestClassifier of sklearn. My code is running fine if I'm using .predict(). However to show the probabilites I'm using .predict_proba(X) and it returns always the same values, even then when X changes. Why is that so and how to fix it?

I'm breaking down my code to the concerning parts:

# ... code ... feature generation / gets the feature data
if rf is None:
    rf = RandomForestClassifier(n_estimators=80)
    rf.fit(featureData, classes)
else:
    prediction = rf.predict(featureData) # gets the right class / always different
    proba = rf.predict_proba(featureData) 
    print proba # this prints always the same values for all my 40 classes

Interestingly max(proba) retrieves the class that .predict() returns in the very first run. Due to .predict() is working as expected I believe the error is at sklearn's side, i.e. I guess there is a flag that needs to be set.

Has anyone an idea?

Tonechas · Accepted Answer

I guess the problem is you are passing always the same argument to predict_proba. Here is my code to build a forest of trees from the iris dataset:

from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
iris = datasets.load_iris()
X = iris.data
y = iris.target
rf = RandomForestClassifier(n_estimators=80)
rf.fit(X, y)

When I call the methods predict and predict_proba, the class and class log-probability predictions for different arguments are also different, as one could reasonably expect.

Sample run:

In [82]: a, b = X[:3], X[-3:]

In [83]: a
Out[83]: 
array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2]])

In [84]: b
Out[84]: 
array([[ 6.5,  3. ,  5.2,  2. ],
       [ 6.2,  3.4,  5.4,  2.3],
       [ 5.9,  3. ,  5.1,  1.8]])

In [85]: rf.predict(a)
Out[85]: array([0, 0, 0])

In [86]: rf.predict(b)
Out[86]: array([2, 2, 2])

In [87]: rf.predict_proba(a)
Out[87]: 
array([[ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.]])

In [88]: rf.predict_proba(b)
Out[88]: 
array([[ 0.    ,  0.    ,  1.    ],
       [ 0.    ,  0.0125,  0.9875],
       [ 0.    ,  0.0375,  0.9625]])

predict_proba(X) of RandomForestClassifier (sklearn) seems to be static?

Answers (1)

Related Questions