Reputation: 423
I use the code to run cross validation, returning ROC scores.
rf = RandomForestClassifier(n_estimators=1000,oob_score=True,class_weight = 'balanced')
scores = cross_val_score ( rf, X,np.ravel(y), cv=10, scoring='roc_auc')
How can I return the ROC based on
roc_auc_score(y_test,results.predict(X_test))
rather than
roc_auc_score(y_test,results.predict_proba(X_test))
Upvotes: 0
Views: 3332
Reputation: 14849
ROC AUC is only useful if you can rank order your predictions. Using .predict()
will just give the most probable class for each sample, and so you won't be able to do that rank ordering.
In the example below, I fit a random forest on a randomly generated dataset and tested it on a held out sample. The blue line shows the proper ROC curve done using .predict_proba()
while the green shows the degenerate one with .predict()
where it only really knows of the one cutoff point.
from sklearn.datasets import make_classification
from sklearn.metrics import roc_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
rf = RandomForestClassifier()
data, target = make_classification(n_samples=4000, n_features=2, n_redundant=0, flip_y=0.4)
train, test, train_t, test_t = train_test_split(data, target, train_size=0.9)
rf.fit(train, train_t)
plt.plot(*roc_curve(test_t, rf.predict_proba(test)[:,1])[:2])
plt.plot(*roc_curve(test_t, rf.predict(test))[:2])
plt.show()
EDIT: While there's nothing stopping you from calculating an roc_auc_score()
on .predict()
, the point of the above is that it's not really a useful measurement.
In [5]: roc_auc_score(test_t, rf.predict_proba(test)[:,1]), roc_auc_score(test_t, rf.predict(test))
Out[5]: (0.75502749115010925, 0.70238005573548234)
Upvotes: 1