LUSAQX
LUSAQX

Reputation: 423

Cross Validation with ROC?

I use the code to run cross validation, returning ROC scores.

rf = RandomForestClassifier(n_estimators=1000,oob_score=True,class_weight  = 'balanced') 
scores = cross_val_score ( rf, X,np.ravel(y), cv=10, scoring='roc_auc')

How can I return the ROC based on

roc_auc_score(y_test,results.predict(X_test))  

rather than

roc_auc_score(y_test,results.predict_proba(X_test))  

Upvotes: 0

Views: 3332

Answers (1)

Randy
Randy

Reputation: 14849

ROC AUC is only useful if you can rank order your predictions. Using .predict() will just give the most probable class for each sample, and so you won't be able to do that rank ordering.

In the example below, I fit a random forest on a randomly generated dataset and tested it on a held out sample. The blue line shows the proper ROC curve done using .predict_proba() while the green shows the degenerate one with .predict() where it only really knows of the one cutoff point.

from sklearn.datasets import make_classification
from sklearn.metrics import roc_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

rf = RandomForestClassifier()

data, target = make_classification(n_samples=4000, n_features=2, n_redundant=0, flip_y=0.4)
train, test, train_t, test_t = train_test_split(data, target, train_size=0.9)

rf.fit(train, train_t)

plt.plot(*roc_curve(test_t, rf.predict_proba(test)[:,1])[:2])
plt.plot(*roc_curve(test_t, rf.predict(test))[:2])
plt.show()

enter image description here

EDIT: While there's nothing stopping you from calculating an roc_auc_score() on .predict(), the point of the above is that it's not really a useful measurement.

In [5]: roc_auc_score(test_t, rf.predict_proba(test)[:,1]), roc_auc_score(test_t, rf.predict(test))
Out[5]: (0.75502749115010925, 0.70238005573548234) 

Upvotes: 1

Related Questions