Difference in linear regression codes

Question

I am self-teaching r from "An Introduction to Statistical Learning: With Applications in R". I am sure I should get the same mean for both codes. However, I get a drastically different result. Can someone please help me find out why am I not getting the same msg? Looks like the first code chunk is wrong. These came from the Auto data set. My predictions and the book's predictions are different. However, the index on which these two were trained was the same.

First Chunk (my code)

set.seed(1)
train_index = sample (392, 196)
Auto$index = c(1:nrow(Auto))
train_df = Auto[train_index,]
test_df = anti_join(Auto, train_df, by="index")
attach(train_df)
lm.fit = lm(mpg ~ horsepower)
predictions = predict(lm.fit, horsepower = test_df$horsepower)

mean((test_df$mpg - predictions)^2)

Second Chunk (book's code - An Introduction to Statistical Learning: With Applications in R)

set. seed (1)
train = sample (392, 196)
lm.fit = lm(mpg ~ horsepower , data = Auto , subset = train)
attach(Auto)

mean (( mpg - predict(lm.fit , Auto))[-train ]^2)

zephryl · Accepted Answer

In your code, you’re not specifying the test data correctly in predict(). predict() takes a dataframe containing predictor variables, passed to the newdata argument; instead, you include horsepower = test_df$horsepower, which just gets absorbed by ... and has no effect.

If you instead pass the whole test_df dataframe to newdata, you get the same result as the text.

library(ISLR)
library(dplyr)
set.seed(1)

# OP’s code with change to predict()
train_index = sample(392, 196)
Auto$index = c(1:nrow(Auto))
train_df = Auto[train_index,]
test_df = anti_join(Auto, train_df, by="index")
attach(train_df)
lm.fit = lm(mpg ~ horsepower)
predictions = predict(lm.fit, newdata = test_df)
mean((test_df$mpg - predictions)^2)
# 23.26601

# ISLR code
set.seed (1)
train = sample (392 , 196)
lm.fit = lm(mpg ~ horsepower , data = Auto , subset = train)
attach(Auto)
mean (( mpg - predict(lm.fit , Auto))[-train ]^2)
# 23.26601

Difference in linear regression codes

Answers (1)

Related Questions