Parseltongue
Parseltongue

Reputation: 11657

How can I identify which observations are used in a linear regression?

I am having trouble running predict after running a linear regression because I cannot figure out which X variables are actually included in the linear regression.

Let's say I run the model:

model1 <- lm(outcome ~ employee + shape + size + color + I(color^2) 
  data = data)

The number of observations identified in the regression output is 224605.
When I try to run predict like so:

test = data.frame(y = predict(model1), x = data$employee)

Error in data.frame(y = predict(model1), x = data$employee) : 
  arguments imply differing number of rows: 224605, 233262

I thought I could get the correct number of observations like so:

> test = na.omit(data, cols = all.vars(model1))
> nrow(test)
[1] 207256

but this still does not yield the correct number of observations. Is there a direct way to grab the observations actually being used by linear regression?

Upvotes: 3

Views: 2696

Answers (2)

Gregor Thomas
Gregor Thomas

Reputation: 145755

Missing observations are omitted by default. If a row has an NA for any of the variables used in the model, it will be omitted. See ?lm and the na.action section for details.

You can run na.omit(data[c("outcome", "employee", ..."color")]) to get the data frame with the omitted variables (put all the columns in your formula into the na.omit(). You can also pull it out of the model object, model1$model is the data frame used for model fitting (with missing values omitted).

You may also want to look into the broom package for tidying up your model. broom::augment is a nice way to add predictions back to the original data.

Upvotes: 2

Cameron Bieganek
Cameron Bieganek

Reputation: 7654

Try model.frame:

set.seed(1)

df <- data.frame(x = rnorm(10), y = rnorm(10))
df[c(3, 5), 1] <- NA
df[7, 2] <- NA

df
#             x           y
# 1  -0.6264538  1.51178117
# 2   0.1836433  0.38984324
# 3          NA -0.62124058
# 4   1.5952808 -2.21469989
# 5          NA  1.12493092
# 6  -0.8204684 -0.04493361
# 7   0.4874291          NA
# 8   0.7383247  0.94383621
# 9   0.5757814  0.82122120
# 10 -0.3053884  0.59390132

fit <- lm(y ~ x, df)

model.frame(fit)
#              y          x
# 1   1.51178117 -0.6264538
# 2   0.38984324  0.1836433
# 4  -2.21469989  1.5952808
# 6  -0.04493361 -0.8204684
# 8   0.94383621  0.7383247
# 9   0.82122120  0.5757814
# 10  0.59390132 -0.3053884

Upvotes: 7

Related Questions