How can I identify which observations are used in a linear regression?

Question

I am having trouble running predict after running a linear regression because I cannot figure out which X variables are actually included in the linear regression.

Let's say I run the model:

model1 <- lm(outcome ~ employee + shape + size + color + I(color^2) 
  data = data)

The number of observations identified in the regression output is 224605.
When I try to run predict like so:

test = data.frame(y = predict(model1), x = data$employee)

Error in data.frame(y = predict(model1), x = data$employee) : 
  arguments imply differing number of rows: 224605, 233262

I thought I could get the correct number of observations like so:

> test = na.omit(data, cols = all.vars(model1))
> nrow(test)
[1] 207256

but this still does not yield the correct number of observations. Is there a direct way to grab the observations actually being used by linear regression?

Cameron Bieganek · Accepted Answer

Try model.frame:

set.seed(1)

df <- data.frame(x = rnorm(10), y = rnorm(10))
df[c(3, 5), 1] <- NA
df[7, 2] <- NA

df
#             x           y
# 1  -0.6264538  1.51178117
# 2   0.1836433  0.38984324
# 3          NA -0.62124058
# 4   1.5952808 -2.21469989
# 5          NA  1.12493092
# 6  -0.8204684 -0.04493361
# 7   0.4874291          NA
# 8   0.7383247  0.94383621
# 9   0.5757814  0.82122120
# 10 -0.3053884  0.59390132

fit <- lm(y ~ x, df)

model.frame(fit)
#              y          x
# 1   1.51178117 -0.6264538
# 2   0.38984324  0.1836433
# 4  -2.21469989  1.5952808
# 6  -0.04493361 -0.8204684
# 8   0.94383621  0.7383247
# 9   0.82122120  0.5757814
# 10  0.59390132 -0.3053884

How can I identify which observations are used in a linear regression?

Answers (2)

Related Questions