mpg
mpg

Reputation: 3919

ROC curve error in randomForest

I am trying to create a ROC curve off the below. I get an error that states Error in prediction(bc_rf_predict_prob, bc_test$Class) : Number of cross-validation runs must be equal for predictions and labels.

library(mlbench) #has the Breast Cancer dataset in it
library(caret)
data(BreastCancer) #two class model

bc_changed<-BreastCancer[2:11] #removes variables not to be used


#Create train and test/holdout samples (works fine)
set.seed(59)
bc_rand <- bc_changed[order(runif(699)), ] #699 observations
bc_rand <- sample(1:699, 499) 
bc_train <- bc_changed[ bc_rand,]
bc_test  <- bc_changed[-bc_rand,]

#random forest decision tree (works fine)
library(caret)
library(randomForest)
set.seed(59) 
bc_rf <- randomForest(Class ~.,data=bc_train, ntree=500,na.action = na.omit, importance=TRUE)

#ROC
library(ROCR)
actual <- bc_test$Class 
bc_rf_predict_prob<-predict(bc_rf, type="prob", bc_test) 
bc.pred = prediction(bc_rf_predict_prob,bc_test$Class) #not work- error

Error-Error in prediction(bc_rf_predict_prob, bc_test$Class) : Number of cross-validation runs must be equal for predictions and labels.

I think it is coming from the fact when I do the:

bc_rf_predict_prob<-predict(bc_rf, type="prob", bc_test) 

I get a matrix as the result with two columns Benign and a list of its probabilities and a second column of Malignant and its list of probabilities. My logic tells me I should only have a vector of probabilities.

Upvotes: 1

Views: 6644

Answers (1)

Myles Baker
Myles Baker

Reputation: 3760

According to page 9 of the ROCR Library documentation, the prediction function has two required inputs, predictions and labels, which must have the same dimensions.

In the case of a matrix or data frame, all cross-validation runs must have the same length.

Since str(bc_rf_predict_prob) > [1] matrix [1:200, 1:2], this means str(bc_test$Class) should have a matching dimension.

It sounds like you only want the first column vector of bc_rf_predict_prob, but I can't be certain without looking at the data.

Upvotes: 2

Related Questions