Reputation: 11
I need to find out the RMSE of a random forest based on regression.
Firstly, I used this formula for the random forest:
randomForest(price ~ ., type = "regression", data = train.data, ntree = 400,
mtry = 20)
Do I need to do a prediction in a further step to find out the RMSE of this? Because I would do a prediction with the test data and then use rmse = (actual, predicted), which I downloaded from the package “Metrics”. Also, is a seed of 12 appropriate for a data with 1000 obs. and 20 variables?
Upvotes: 1
Views: 11569
Reputation: 10855
In the scenario where one has partitioned data into training
and test
groups, to calculate root mean squared error (RMSE) on the test data, one uses the predict()
function and then calculates RMSE.
We'll use the BostonHousing
data from the mlbench
package to illustrate.
library(randomForest)
library(mlbench)
library(caret) # use createDataPartition() function
set.seed(95014)
data(BostonHousing)
# partition based on whether house is adjacent to Charles River
inTraining <- createDataPartition(BostonHousing$chas, p = 0.6, list=FALSE)
training <- BostonHousing[inTraining,]
testing <- BostonHousing[-inTraining,]
fit <- randomForest(medv ~ ., training, ntree=30, type="regression")
Having generated the model, we can see the mean squared error in the training
data set by printing the model output.
fit
> fit
Call:
randomForest(formula = medv ~ ., data = training, ntree = 30, type = "regression")
Type of random forest: regression
Number of trees: 30
No. of variables tried at each split: 4
Mean of squared residuals: 16.90869
% Var explained: 81.51
To calculate RMSE, we can also extract the last element of fit$mse
, which corresponds to the final tree created, and take its square root.
# obtain MSE as of last element in fit$mse
# which should match the output from printout
fit$mse[length(fit$mse)]
# take square root to calculate RMSE for the model
sqrt(fit$mse[length(fit$mse)])
> fit$mse[length(fit$mse)]
[1] 16.90869
> sqrt(fit$mse[length(fit$mse)])
[1] 4.112018
To calculate RMSE for the test data, we need to first generate predicted values.
# now illustrate how to calculate RMSE on test data vs. training data
predValues <- predict(fit,testing)
RMSE is simply the square root of the average of the squared errors.
# we can calculate it directly
sqrt(mean((testing$medv - predValues)^2))
> sqrt(mean((testing$medv - predValues)^2))
[1] 2.944943
>
Alternately, we can load the Metrics
library and use its rmse()
function. Notice that it produces the same result that we calculated from Base R.
# compare to Metrics::rmse() function
library(Metrics)
rmse(testing$medv,predValues)
> rmse(testing$medv,predValues)
[1] 2.944943
Regarding the question about seed, the set.seed()
function fixes the start of the random number generator to make the results of an analysis reproducible. It does not impact the 'quality' of the analysis.
By using set.seed(95014)
before using any R functions that access the random number generator, anyone who runs the code from this answer will receive exactly the same results for rmse()
as they were posted in this answer.
caret::createDataPartition()
uses the random number generator to partition the houses based on their adjacency to the Charles River. Setting a seed prior to this step ensures that everyone who runs the code in this answer obtains the same observations of data in the training & testing data frames as I did.
Upvotes: 4
Reputation: 46908
Yes, you need to use the predictions on your test data. I don't know at which point you set your seed, so in the example below, I set seed once when splitting the data into train and test, so that this train,test set can be reproduced. The other instance is before running randomForest (in the lapply). The seed is for you to reproduce the results of the randomForest.
For example:
library(randomForest)
library(MASS)
data = Boston
set.seed(999)
trn = sample(nrow(data),400)
traindata = data[trn,]
testdata = data[-trn,]
res = lapply(c(111,222),function(i){
set.seed(i)
fit = randomForest(medv ~.,data=traindata)
pred_values = predict(fit,testdata)
actual_values = testdata$medv
data.frame(seed=i,
metrics_rmse = rmse(pred_values,actual_values),
cal_rmse = mean((pred_values-actual_values)^2)^0.5
)
})
res = do.call(rbind,res)
head(res)
seed metrics_rmse cal_rmse
1 111 4.700245 4.700245
2 222 4.742978 4.742978
Upvotes: 3