XYZ
XYZ

Reputation: 105

in R, predict yields wrong length

new.y = predict(model, newx = new.x), the length of new.y is different from the row length of new.x

Code is here:

install.packages('ISLR')
library(ISLR)

fix(Hitters)   # load data
Hitters = na.omit(Hitters)   # remove NA

x = model.matrix(Salary ~ ., Hitters)[ , -1]
y = Hitters$Salary

set.seed(1)
train = sample(1:nrow(x), nrow(x)/2)   # random sampling
test = (-train)

lm.fit = lm(y ~ x, subset=train)  
lm.pred = predict( lm.fit, newx = x[test,])

dim(x[test,])   # output 132*19
length(lm.pred)   # output 131
length(y[test])   # output 132

Does anybody know why the length is wrong? Thanks!

Update: the mistake is newx = x[test, ] wasn't recognized by predict Thanks @Pascal! To make it more obvious:

install.packages('ISLR')
library(ISLR)

fix(Hitters)   # load data
Hitters = na.omit(Hitters)   # remove NA

x = model.matrix(Salary ~ ., Hitters)[ , -1]
y = Hitters$Salary

set.seed(2)
train = sample(1:nrow(x), 150)   # random sampling (specify size for testing)
test = (1:nrow(x))[-train]


lm.fit = lm(y ~ x, subset=train)
lm.pred = predict( lm.fit, newx = x[test,])
dim(x[test,])   # output 113  19
length(lm.pred)   # output 150 - still using training data


lm.fit = lm(Salary ~ ., data = Hitters, subset = train)
lm.pred = predict( lm.fit, newdata = Hitters[test,])
dim(x[test,])   # output 113  19
length(lm.pred)   # output 113

The ways of defining test in the 1st and 2nd code should work the same. Test:

x = c('A','B','C','D','E')
set.seed(2)
n = length(x)
train = sample(1:n, n/2)   # random sampling
test = -train
test   # output -1 -3
x[test]   # output "B" "D" "E"

test = (1:n)[-train]
test   # output 2 4 5
x[test]   # output "B" "D" "E"

Upvotes: 1

Views: 5708

Answers (3)

user3710546
user3710546

Reputation:

You can simplify with:

library(ISLR)

Hitters <- na.omit(Hitters)   # remove NA

set.seed(1)
train <- sample(1:nrow(Hitters), nrow(Hitters)/2)   # random sampling
test <- (1:nrow(Hitters))[-train]  # your definition of test was incorrect

lm.fit <- lm(Salary ~ ., data = Hitters, subset = train)  
lm.pred <- predict(lm.fit, newdata = Hitters[test,])

dim(Hitters[test,])   # output 132*20
length(lm.pred)   # output 132

Upvotes: 1

Jthorpe
Jthorpe

Reputation: 10167

try providing a data.frame to the argument newdata as in:

lm.pred <- predict(lm.fit,
                   newdata=data.frame(x=x[test,],y=0))

Also, I'm not sure the argument subset is doing what you think it is doing. I would instead provide the argument data in your call to lm as in:

lm.fit = lm(y ~ x, 
            data=data.frame(x=x,y=y)[train,])  

Upvotes: 0

KarthikS
KarthikS

Reputation: 903

Try this:

install.packages('ISLR')
library(ISLR)

fix(Hitters)   # load data
Hitters = na.omit(Hitters)   # remove NA

x = Hitters[,-1] 
y = Hitters$Salary

set.seed(1)
train = sample(1:nrow(x), nrow(x)/2)   # random sampling
test_data <- x[-train,]
y_test <- y[-train]
y_train<-y[train]
train_data <- data.frame(Y= y[train],x[train,]) 

lm.fit = lm(Y ~ ., train_data)  
lm.pred = predict( lm.fit, newx = test_data)

dim(test_data)   # output 161*19
length(lm.pred)   # output 130
length(y_test)   # output 161

I guess the difference in length of lm.pred is due to the null values in y

Upvotes: 0

Related Questions