Reputation: 407
I am having a hard time understanding how to build a ROC curve and now I came to the conclusion that maybe I don't create the model correctly. I am running a randomforest model in the dataset where the class attribute "y_n" is 0 or 1. I have divided the datasets as bank_training and bank_testing for the prediction purpose. Here are the steps i do:
bankrf <- randomForest(y_n~., data=bank_training, mtry=4, ntree=2,
keep.forest=TRUE, importance=TRUE)
bankrf.pred <- predict(bankrf, bank_testing, type='response',
predict.all=TRUE, norm.votes=TRUE)
Is it correct what I do till now? The bankrf.pred object that is created is a list object with 2 classes named: aggregate and individuals. I dont understand where did this 2 class names came out? Moreover when I run:
summary(bankrf.pred)
Length Class Mode
aggregate 22606 factor numeric
individual 45212 -none- character
What does this summary mean? The datasets (training & testing) are 22605 and 22606 long each. If someone can explain me what is happening I would be very grateful. I think there is something wrong in all this.
When I try to design the ROC curve with ROCR I use the following code:
library(ROCR)
pred <- prediction(bank_testing$y_n, bankrf.pred$c(0,1))
Error in is.data.frame(labels) : attempt to apply non-function
Is just a mistake in the way I try to create the ROC curve or is it from the beginning with randomForest?
Upvotes: 0
Views: 3265
Reputation: 36
You should erase the predict.all=TRUE
argument from predict
if you simply want to get the predicted classes. By using predict.all=TRUE
you are telling the function to keep the predictions of all trees rather than the prediction from the forest.
Upvotes: 0
Reputation: 173527
The documentation for the function you are attempting to use includes this description of its two main arguments:
predictions A vector, matrix, list, or data frame containing the predictions.
labels A vector, matrix, list, or data frame containing the true class labels. Must have the same dimensions as 'predictions'.
You are currently passing the variable y_n
to the predictions
argument, and what looks to me like nonsense to the labels
argument.
The predictions will be stored in the output of the random forest model. As documented at ?predict.randomForest
, it will be a list with two components. aggregate
will contain the predicted values for the entire forest, while individual
will contain the predicted values for each individual tree.
So you probably want to do something like this:
predictions(bankrf.pred$aggregate, bank_testing$y_n)
See how that works? The predicted values are passed to the predictions
argument, while the "labels" or true values, are passed to the labels
argument.
Upvotes: 2