Lea
Lea

Reputation: 11

How do I find out the RMSE of a random forest in R?

I need to find out the RMSE of a random forest based on regression.

Firstly, I used this formula for the random forest:

randomForest(price ~ ., type = "regression", data = train.data, ntree  = 400,
             mtry = 20)

Do I need to do a prediction in a further step to find out the RMSE of this? Because I would do a prediction with the test data and then use rmse = (actual, predicted), which I downloaded from the package “Metrics”. Also, is a seed of 12 appropriate for a data with 1000 obs. and 20 variables?

Upvotes: 1

Views: 11569

Answers (2)

Len Greski
Len Greski

Reputation: 10855

In the scenario where one has partitioned data into training and test groups, to calculate root mean squared error (RMSE) on the test data, one uses the predict() function and then calculates RMSE.

We'll use the BostonHousing data from the mlbench package to illustrate.

library(randomForest)
library(mlbench)
library(caret) # use createDataPartition() function 
set.seed(95014)
data(BostonHousing)

# partition based on whether house is adjacent to Charles River 
inTraining <- createDataPartition(BostonHousing$chas, p = 0.6, list=FALSE)
training <- BostonHousing[inTraining,]
testing <- BostonHousing[-inTraining,]

fit <- randomForest(medv ~ ., training, ntree=30, type="regression")

Having generated the model, we can see the mean squared error in the training data set by printing the model output.

fit

> fit

Call:
 randomForest(formula = medv ~ ., data = training, ntree = 30,      type = "regression") 
               Type of random forest: regression
                     Number of trees: 30
No. of variables tried at each split: 4

          Mean of squared residuals: 16.90869
                    % Var explained: 81.51

To calculate RMSE, we can also extract the last element of fit$mse, which corresponds to the final tree created, and take its square root.

# obtain MSE as of last element in fit$mse
# which should match the output from printout
fit$mse[length(fit$mse)]
# take square root to calculate RMSE for the model
sqrt(fit$mse[length(fit$mse)])


> fit$mse[length(fit$mse)]
[1] 16.90869
> sqrt(fit$mse[length(fit$mse)])
[1] 4.112018

To calculate RMSE for the test data, we need to first generate predicted values.

# now illustrate how to calculate RMSE on test data vs. training data
predValues <- predict(fit,testing)

RMSE is simply the square root of the average of the squared errors.

# we can calculate it  directly 
sqrt(mean((testing$medv - predValues)^2))

> sqrt(mean((testing$medv - predValues)^2))
[1] 2.944943
>

Alternately, we can load the Metrics library and use its rmse() function. Notice that it produces the same result that we calculated from Base R.

# compare to Metrics::rmse() function
library(Metrics)
rmse(testing$medv,predValues)

> rmse(testing$medv,predValues)
[1] 2.944943

Regarding the question about seed, the set.seed() function fixes the start of the random number generator to make the results of an analysis reproducible. It does not impact the 'quality' of the analysis.

By using set.seed(95014) before using any R functions that access the random number generator, anyone who runs the code from this answer will receive exactly the same results for rmse() as they were posted in this answer.

caret::createDataPartition() uses the random number generator to partition the houses based on their adjacency to the Charles River. Setting a seed prior to this step ensures that everyone who runs the code in this answer obtains the same observations of data in the training & testing data frames as I did.

Upvotes: 4

StupidWolf
StupidWolf

Reputation: 46908

Yes, you need to use the predictions on your test data. I don't know at which point you set your seed, so in the example below, I set seed once when splitting the data into train and test, so that this train,test set can be reproduced. The other instance is before running randomForest (in the lapply). The seed is for you to reproduce the results of the randomForest.

For example:

library(randomForest)
library(MASS)
data = Boston
set.seed(999)
trn = sample(nrow(data),400)
traindata = data[trn,]
testdata = data[-trn,]

res = lapply(c(111,222),function(i){
set.seed(i)
fit = randomForest(medv ~.,data=traindata)

pred_values = predict(fit,testdata)
actual_values = testdata$medv

data.frame(seed=i,
metrics_rmse = rmse(pred_values,actual_values),
cal_rmse = mean((pred_values-actual_values)^2)^0.5
)
})

res = do.call(rbind,res)
head(res)

  seed metrics_rmse cal_rmse
1  111     4.700245 4.700245
2  222     4.742978 4.742978

Upvotes: 3

Related Questions