elysefaulkner
elysefaulkner

Reputation: 1053

How can I calculate Residual Standard Error in R for Test Data set?

I have split the Boston dataset into training and test sets as below:

  library(MASS)
  smp_size <- floor(.7 * nrow(Boston))
  set.seed(133)
  train_boston <- sample(seq_len(nrow(Boston)), size = smp_size)
  train_ind <- sample(seq_len(nrow(Boston)), size = smp_size)
  train_boston <- Boston[train_ind, ]
  test_boston <- Boston[-train_ind,]
  nrow(train_boston)
  # [1] 354
  nrow(test_boston)
  # [1] 152

Now I get the RSE using lm function as below:

  train_boston.lm <- lm(lstat~medv, train_boston)
  summary(train_boston.lm)
  summary(train_boston.lm)$sigma

How can I calculate Residual Standard error for the test data set? I can't use lm function on the test data set. Is there any method to calculate RSE on test data set?

Upvotes: 0

Views: 11226

Answers (1)

MrFlick
MrFlick

Reputation: 206606

Here your residual standard error is the same as

summary(train_boston.lm)$sigma
# [1] 4.73988

sqrt(sum((fitted(train_boston.lm)-train_boston$lstat)^2)/
    (nrow(train_boston)-2))
# [1] 4.73988

you loose are estimating two parameters so your degrees of freedom is n-2

With your test data, you're not really doing the same estimation, but if you wanted to calculate the same type of calculation substituting the predicted value from the model for your new data for the fitted values from the original model, you can do

sqrt(sum((predict(train_boston.lm, test_boston)-test_boston$lstat)^2)/
    (nrow(test_boston)-2))

Although it may make more sense just to calculate the standard deviation of the predicted residuals

sd(predict(train_boston.lm, test_boston)-test_boston$lstat)

Upvotes: 3

Related Questions