Senf
Senf

Reputation: 161

R- Random Forest - Importance / varImPlot

I have an issue with Random Forest with the Importance / varImPlot function, I hope someone could help me with?

I tried to code versions but I am confused about the (different) results:

1.)

rffit = randomForest(price~.,data=train,mtry=x,ntree=500)
rfvalpred = predict(rffit,newdata=test)
varImpPlot(rffit)
importance(rffit)

Shows the plot and the data of “importance”, however only “IncNodePurity”. And the data is different the plot and the data, I tried with "Scale" but did not work.

2.)

rf.analyzed_data = randomForest(price~.,data=train,mtry=x,ntree=500,importance=TRUE)
yhat.rf = predict(rf.analyzed_data,newdata=test)
varImpPlot(rf.analyzed_data)
importance(rf.analyzed_data)

In that case it does not produce any plot anymore and the importance data is showing “%IncMSE” and “IncNodePurity” data but the “IncNodePurity” data is different to first code? Questions: 1.) Any idea why data is different for “IncNodePurity”? 2.) Any idea why no “%IncMSE” is shown in the first version? 3.) Why no plot is shown in the second version?

Many thanks!! Ed

Upvotes: 0

Views: 6006

Answers (1)

Soren Havelund Welling
Soren Havelund Welling

Reputation: 1893

1) IncNodePurity is derived from the loss function, and you get that measure for free just by training the model. On the downside it is a more unstable estimate as results may vary from each model run. It is also more biased as it favors variables with many levels. I guess your found the differences are due to randomness.

2) VI, %IncMSE takes a little extra time to compute and is therefore optional. Roughly all values in data set needs to be shuffled and every OOB sample needs to be predicted once for every tree times for every variable. As the package randomForest is designed, you have to compute VI during training. importance must be set to TRUE. varImpPlot cannot plot it as it has not been computed.

3) Not sure. In this code example I see both plots at least.

library(randomForest)

#data 
X = data.frame(replicate(6,rnorm(1000)))
y = with(X, X1^2 + sin(X2*pi) + X3*X4)
train = data.frame(y=y,X=X)
#training
rf1=randomForest(y~.,data=train,importance=F)
rf2=randomForest(y~.,data=train, importance=T)
#plotting importnace
varImpPlot(rf1) #plot only with IncNodePurity

enter image description here

varImpPlot(rf2) #bi-plot also with %IncMSE

enter image description here

Upvotes: 2

Related Questions