EmJ
EmJ

Reputation: 4618

How to get the prediction probabilities using cross validation in scikit-learn

I am using RandomForestClassifier as follows using cross validation for a binary classification (class labels are 0 and 1).

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score

clf=RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
accuracy = cross_val_score(clf, X, y, cv=k_fold, scoring = 'accuracy')
print("Accuracy: " + str(round(100*accuracy.mean(), 2)) + "%")
f1 = cross_val_score(clf, X, y, cv=k_fold, scoring = 'f1_weighted')
print("F Measure: " + str(round(100*f1.mean(), 2)) + "%")

Now I want to order my data using prediction probabilities of class 1 with cross validation results. For that I tried the following two ways.

pred = clf.predict_proba(X)[:,1]
print(pred)

probs = clf.predict_proba(X)
best_n = np.argsort(probs, axis=1)[:,-6:]

I get the following error

NotFittedError: This RandomForestClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

for both the situations.

I am just wondering where I am making things wrong.

I am happy to provide more details if needed.

Upvotes: 2

Views: 2175

Answers (3)

Venkatachalam
Venkatachalam

Reputation: 16966

In case, you want to use the CV model for a unseen data point/s, use the following approach.

from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate

iris = datasets.load_iris()
X = iris.data
y = iris.target
clf = RandomForestClassifier(n_estimators=10, random_state = 42, class_weight="balanced")

cv_results = cross_validate(clf, X, y, cv=3, return_estimator=True)

clf_fold_0 = cv_results['estimator'][0]

clf_fold_0.predict_proba([iris.data[133]])

# array([[0. , 0.5, 0.5]])

Upvotes: 2

EmJ
EmJ

Reputation: 4618

I solved my problem using the following code:

proba = cross_val_predict(clf, X, y, cv=k_fold, method='predict_proba')
print(proba[:,1])
print(np.argsort(proba[:,1]))

Upvotes: 1

skillsmuggler
skillsmuggler

Reputation: 1902

Have a look at the documentation it specifies that the probability is calculated based on the mean results from the trees.

In your case, you first need to call the fit() method to generate the tress in the model. Once you fit the model on the training data, you can call the predict_proba() method.

This is also specified in the error.

# Fit model
model = RandomForestClassifier(...)
model.fit(X_train, Y_train)

# Probabilty
model.predict_proba(X)[:,1]

Upvotes: 1

Related Questions