leeneumann
leeneumann

Reputation: 27

Getting a warning using predict function in R

I have a data set of 400 observations which I divided in 2 separate sets one for training (300 observations) and one for testing (100 observations). I am trying to create a step function regression, the problem is once I try to use the model in order to predict values form the test set I get a warning:

Warning message: 'newdata' had 100 rows but variables found have 300 rows

The variable I am trying to predict is Income and the explanatory variable is called Age.

This is the code:

fit=lm(Income∼cut (training$Age ,4) ,data=training)
predict(fit,test)

Instead of getting 100 predictions based on the test data I get a warning sign and 300 predictions based on the training data.

I read about other people having this question and usually the answer has to do with the name of the variable being different in the data set and in the model, but I don't think this is the problem because while using a regular simple regression I don't get a warning :

lm.fit=lm(Income~Age,data = training)
predict(lm.fit,test)

Upvotes: 0

Views: 1769

Answers (1)

G5W
G5W

Reputation: 37661

There are a number of problems here, so it will take several steps to get to a good answer. You did not provide data so I am going to use other data that gets the same kind of error message. The built-in data set iris has 4 continuous variables. I will arbitrarily select two for use here, then apply code just like yours

MyData = iris[,3:4]
set.seed(2017)         # for reproducibility
T = sample(150, 100)
training = MyData[ T, ]
test     = MyData[-T, ]

fit=lm(Petal.Width ~ cut(training$Petal.Length, 4), data=training)
predict(fit,test)
Warning message:
'newdata' had 50 rows but variables found have 100 rows 

So I am getting the same type of error.

cut is changing the continuous variable Petal.Length into a factor with 4 levels. You built your model on the factor, but when you try to predict the new values, you just passed in test, which still has the continuous values (Age in your data; Petal.Length in mine). Trying to evaluate the predict statement, we need to evaluate cut(test$Petal.Length, 4) as part of the process. Look at what that means.

C1 = cut(training$Petal.Length, 4)
C2 = cut(test$Petal.Length, 4)
levels(C1)
[1] "(0.994,2.42]" "(2.42,3.85]"  "(3.85,5.28]"  "(5.28,6.71]" 
levels(C2)
[1] "(1.09,2.55]" "(2.55,4]"    "(4,5.45]"    "(5.45,6.91]"

The levels are completely different. There is no way that your model can be used on these different levels. You can see the bin boundaries for C1 so it is tempting to just use those boundaries and partition the test data.

levels(C1)
"[0.994,2.42]" "(2.42,3.85]"  "(3.85,5.28]"  "(5.28,6.71]"
CutPoints = c(0.994, 2.42, 3.85, 5.28, 6.71) 
C2 = cut(test$Petal.Length, breaks=CutPoints, include.lowest=TRUE)

But under careful examination, you will see that this did not work. Just printing out a relevant piece of the data

C2[42:46]
[1] (5.28,6.71] (5.28,6.71]  <NA> (3.85,5.28] (3.85,5.28]

C2[44] is undefined. Why? One of the values in the test set fell outside the range of values for the training set, so it does not belong in any bin.

test$Petal.Length[44]
[1] 6.9

So what you really need to do is impose no lower limit or upper limit.

## cut the training data to get cut points
C1 = cut(training$Petal.Length, 4)
levels(C1)
"[0.994,2.42]" "(2.42,3.85]"  "(3.85,5.28]"  "(5.28,6.71]"
CutPoints = c(-Inf, 2.42, 3.85, 5.28, Inf)

It may be easiest to just make new data.frames with the binned data

Binned.training = training
Binned.training$Petal.Length = cut(training$Petal.Length, CutPoints)
Binned.test = test
Binned.test$Petal.Length = cut(test$Petal.Length, CutPoints)

fit=lm(Petal.Width ~ Petal.Length, data=Binned.training)
predict(fit,Binned.test)
## No errors

This will work for your test data and any data that you get in the future.

Upvotes: 3

Related Questions