Reputation: 35
I am running into difficulties when using randomForest
(in R) for a classification problem. My R code, an image, and data are here:
http://www.psy.plymouth.ac.uk/research/Wsimpson/data.zip
The observer is presented with either a faint image (contrast=con
) buried in noise or just noise on each trial. He rates his confidence (rating
) that the face is present. I have categorised rating
to be a yes/no judgement (y
). The face is either inverted (invert=1
) or not in each block of 100 trials (one file). I use the contrast (1st column of predictor matrix x
) and the pixels (the rest of the columns) to predict y
.
It is critical to my application that I have an "importance image" at the end which shows how much each pixel contributes to the decision y
. I have 1000 trials (length of y
) and 4248 pixels+contrast=4249 predictors (ncols of x
). Using glmnet
(logistic ridge regression) on this problem works fine
fit<-cv.glmnet(x,y,family="binomial",alpha=0)
However randomForest
does not work at all,
fit <- randomForest(x=x, y=y, ntree=100)
and it gets worse as the number of trees increases. For invert=1
, the classification error for randomForest
is 34.3%, and for glmnet
it is 8.9%.
Please let me know what I am doing wrong with randomForest
, and how to fix it.
Upvotes: 0
Views: 2583
Reputation: 1513
ridge regression's only parameter lambda is chosen via internal cross-validation in cv.glmnet
, as pointed out by Hong Ooi. and the error rate you get out of cv.glmnet
realtes to that. randomForest
gives you OOB error that is akin to an error on a dedicated test set (which is what you are interested in).
randomForest
requires you to calibrate it manually (i.e. have a dedicated validation set to see which parameters work best) and there are a few to consider: depth of the trees (via fixing the number of examples in each node or the number of nodes), number of randomly chosen attributes considered at each split and the number of trees. you can use tuneRF
to find the optimal number of mtry
.
when evaluated on the train set, the more trees you add the better your predictions get. however, you will see predictive ability on a test set starts diminishing after a certain number of trees are grown -- this is due to overfitting. randomForest
determines the optimal number of trees via OOB error estimates or, if you provide it, by using the test set. if rf.mod
is your fitted RF model then plot(rf.mod)
will allow you to see at which point roughly it starts to overfit. when using the predict
function on a fitted RF it will use the optimal number of trees.
in short, you are not comparing the two models' performances correctly (as pointed out by Hong Ooi) and also your parameters might be off and/or you might be overfitting (although unlikely with just 100 trees).
Upvotes: 1