stan
stan

Reputation: 35

random forest gets worse as number of trees increases

I am running into difficulties when using randomForest (in R) for a classification problem. My R code, an image, and data are here: http://www.psy.plymouth.ac.uk/research/Wsimpson/data.zip The observer is presented with either a faint image (contrast=con) buried in noise or just noise on each trial. He rates his confidence (rating) that the face is present. I have categorised rating to be a yes/no judgement (y). The face is either inverted (invert=1) or not in each block of 100 trials (one file). I use the contrast (1st column of predictor matrix x) and the pixels (the rest of the columns) to predict y.

It is critical to my application that I have an "importance image" at the end which shows how much each pixel contributes to the decision y. I have 1000 trials (length of y) and 4248 pixels+contrast=4249 predictors (ncols of x). Using glmnet (logistic ridge regression) on this problem works fine

fit<-cv.glmnet(x,y,family="binomial",alpha=0)

However randomForest does not work at all,

fit <- randomForest(x=x, y=y, ntree=100)

and it gets worse as the number of trees increases. For invert=1, the classification error for randomForest is 34.3%, and for glmnet it is 8.9%.

Please let me know what I am doing wrong with randomForest, and how to fix it.

Upvotes: 0

Views: 2583

Answers (1)

stas g
stas g

Reputation: 1513

ridge regression's only parameter lambda is chosen via internal cross-validation in cv.glmnet, as pointed out by Hong Ooi. and the error rate you get out of cv.glmnet realtes to that. randomForest gives you OOB error that is akin to an error on a dedicated test set (which is what you are interested in).

randomForest requires you to calibrate it manually (i.e. have a dedicated validation set to see which parameters work best) and there are a few to consider: depth of the trees (via fixing the number of examples in each node or the number of nodes), number of randomly chosen attributes considered at each split and the number of trees. you can use tuneRF to find the optimal number of mtry.

when evaluated on the train set, the more trees you add the better your predictions get. however, you will see predictive ability on a test set starts diminishing after a certain number of trees are grown -- this is due to overfitting. randomForest determines the optimal number of trees via OOB error estimates or, if you provide it, by using the test set. if rf.mod is your fitted RF model then plot(rf.mod) will allow you to see at which point roughly it starts to overfit. when using the predict function on a fitted RF it will use the optimal number of trees.

in short, you are not comparing the two models' performances correctly (as pointed out by Hong Ooi) and also your parameters might be off and/or you might be overfitting (although unlikely with just 100 trees).

Upvotes: 1

Related Questions