tassones
tassones

Reputation: 1692

How can I apply linear model coefficients to input data for multiple groups?

My input dependent variable has NA values but was used to build linear models for several groups. How can I use the linear model coefficients to derive predicted values for each row among each group?

Example data

library(dplyr)

df <- iris
df[c(1:2,51:52,101:102),3] <- NA # add NA values to dependent variable

# Creates a linear model for each Species group

fitted_models <- df %>%
  group_by(Species) %>%
  do(model = lm(Petal.Length~Sepal.Length+Sepal.Width, data = .))

The below code works so long as there are no NA values in the dependent variables. However, the below code throws the following error message when you try to use it with the data with NA values (i.e., df): Error: Incompatible lengths: 50, 48.

library(tidyr)
library(purrr)

# works
fitted_models <- iris %>%
  nest(data = -Species) %>% 
  mutate(fit = map(data, ~ lm(Petal.Length ~ Sepal.Length + Sepal.Width, data = .x)),
         fitted.values = map(fit, "fitted.values")) %>% 
  unnest(cols = c(data, fitted.values)) %>% 
  select(-fit)

# does not work
fitted_models <- df %>%
  nest(data = -Species) %>% 
  mutate(fit = map(data, ~ lm(Petal.Length ~ Sepal.Length + Sepal.Width, data = .x)),
         fitted.values = map(fit, "fitted.values")) %>% 
  unnest(cols = c(data, fitted.values)) %>% 
  select(-fit)

The ideal output (see below) would use the coefficients from each of the Species models to derive predicted values for each row, including those with NA in the dependent variable.

   Species Sepal.Length Sepal.Width Petal.Length Petal.Width predicted.values
   <fct>          <dbl>       <dbl>        <dbl>       <dbl>         <dbl>
 1 setosa           5.1         3.5          NA          0.2          1.47
 2 setosa           4.9         3            NA          0.2          1.46
 3 setosa           4.7         3.2          1.3         0.2          1.42
 4 setosa           4.6         3.1          1.5         0.2          1.41
 5 setosa           5           3.6          1.4         0.2          1.46
 6 setosa           5.4         3.9          1.7         0.4          1.51
 7 setosa           4.6         3.4          1.4         0.3          1.40
 8 setosa           5           3.4          1.5         0.2          1.46
 9 setosa           4.4         2.9          1.4         0.2          1.38
10 setosa           4.9         3.1          1.5         0.1          1.45
# ... with 140 more rows

Upvotes: 1

Views: 242

Answers (1)

wibeasley
wibeasley

Reputation: 5287

This version with predict() and purrr::map2() works.

f <- function (.fit, .new_data) {
  predict(.fit, newdata = .new_data)
}

df %>%
  nest(data = -Species) %>% 
  mutate(
    fit  = map(data, ~ lm(Petal.Length ~ Sepal.Length + Sepal.Width, data = .x)),
    yhat = map2(.x = fit, .y = data, f)
  ) %>% 
  unnest(cols = c(data, yhat)) %>% 
  select(-fit)

Output:

# A tibble: 150 × 6
   Species Sepal.Length Sepal.Width Petal.Length Petal.Width  yhat
   <fct>          <dbl>       <dbl>        <dbl>       <dbl> <dbl>
 1 setosa           5.1         3.5         NA           0.2  1.48
 2 setosa           4.9         3           NA           0.2  1.46
 3 setosa           4.7         3.2          1.3         0.2  1.42
 4 setosa           4.6         3.1          1.5         0.2  1.41
 5 setosa           5           3.6          1.4         0.2  1.46
 6 setosa           5.4         3.9          1.7         0.4  1.51
 7 setosa           4.6         3.4          1.4         0.3  1.40
 8 setosa           5           3.4          1.5         0.2  1.46
 9 setosa           4.4         2.9          1.4         0.2  1.39
10 setosa           4.9         3.1          1.5         0.1  1.46
# … with 140 more rows

I wish I could figure it out with the fitted.values variable that's already calculated. I tried adding a row id before nesting, so the two data.frames could be left-joined (because data has all 50 rows). But then I thought the two unnesting and one join operations are more conceptual work than rerunning the linear equation used in predict().

Upvotes: 2

Related Questions