Duck
Duck

Reputation: 39585

Make predictions after merging estimated models in a dataframe for all observations using broom and dplyr

I am working in a modeling problem where I have to estimate multiple models per group using specific variables. After having all models per groups I need to compute the estimated values (fitted values) and standard errors (se) for all of them. I found we can use broom and dplyr for this. So, I sketched next code using iris data:

library(dplyr)
library(broom)
#Data
data("iris")
#Code
iris2 <- iris %>% group_by(Species)

Data iris2 has a group based on Species. With this I compute the different models using next code:

#Models
models <- iris2 %>% 
  do(
  model1 = lm(Sepal.Length~Sepal.Width, data = .),
  model2 = lm(Sepal.Width~Petal.Width, data = .),
  model3 = lm(Petal.Width~Sepal.Length+Sepal.Width, data = .),
  model4 = lm(Petal.Width~Petal.Length+Sepal.Length, data = .))

Which produces:

models
# A tibble: 3 x 5
# Rowwise: 
  Species    model1 model2 model3 model4
  <fct>      <list> <list> <list> <list>
1 setosa     <lm>   <lm>   <lm>   <lm>  
2 versicolor <lm>   <lm>   <lm>   <lm>  
3 virginica  <lm>   <lm>   <lm>   <lm>  

All is fine. Now, I need to compute the predictions for the entire dataset iris2 using the four models. I used a merge approach to add the models to the iris2 dataframe:

#Merge with original data
Merged <- iris2 %>%
  left_join(models)

At this point, the data has all the models but I am not sure of how to proceed to compute the fitted values and standard error. In the end I would like to have two additional columns per each model like predict.mod1 and se.mod1 until predict.mod4 and se.mod4 for all the observations in iris2.

This is my main issue as I do not know how to use predict() to take each model1, model2, model3 and model4 to compute both fitted and se values. I know I can use predict() and predict(model,data,se=T)$$se.fit to obtain the values in an isolated dataframe with one model but in this case I have multiple models after the merge action.

I have checked some posts around the site and they use mutate(result = map2(fit, data, predict)) to create the estimations. I tried a similar approach but did not work.

Many thanks for your help.

Upvotes: 1

Views: 298

Answers (1)

akrun
akrun

Reputation: 886938

Instead of separating the dataset and joining, could create the models in the same data after nesting

library(dplyr)
library(broom)
iris1 <- iris %>% 
    nest_by(Species) %>% 
    mutate(model1 = list(lm(Sepal.Length ~ Sepal.Width, data = data)),
           model2 = list(lm(Sepal.Width~Petal.Width, data = data)), 
           model3 = list(lm(Petal.Width~Sepal.Length+Sepal.Width, data = data)),
           model4 = list(lm(Petal.Width~Petal.Length+Sepal.Length, data = data)))

-output

iris1
# A tibble: 3 x 6
# Rowwise:  Species
#  Species              data model1 model2 model3 model4
#  <fct>      <list<tibble>> <list> <list> <list> <list>
#1 setosa           [50 × 4] <lm>   <lm>   <lm>   <lm>  
#2 versicolor       [50 × 4] <lm>   <lm>   <lm>   <lm>  
#3 virginica        [50 × 4] <lm>   <lm>   <lm>   <lm>

Create the columns in a similar pipeline as used by @LMc except that the 'data' can be passed in (which is not really needed here unless there is a different data)

iris1 %>% 
 mutate(across(starts_with('model'), 
    list(se = ~ list(tidy(.)$std.error), 
        predict = ~ list(predict(., data)))))
# A tibble: 3 x 14
# Rowwise:  Species
#  Species              data model1 model2 model3 model4 model1_se model1_predict model2_se model2_predict model3_se model3_predict model4_se
#  <fct>      <list<tibble>> <list> <list> <list> <list> <list>    <list>         <list>    <list>         <list>    <list>         <list>   
#1 setosa           [50 × 4] <lm>   <lm>   <lm>   <lm>   <dbl [2]> <dbl [50]>     <dbl [2]> <dbl [50]>     <dbl [3]> <dbl [50]>     <dbl [3]>
#2 versicolor       [50 × 4] <lm>   <lm>   <lm>   <lm>   <dbl [2]> <dbl [50]>     <dbl [2]> <dbl [50]>     <dbl [3]> <dbl [50]>     <dbl [3]>
#3 virginica        [50 × 4] <lm>   <lm>   <lm>   <lm>   <dbl [2]> <dbl [50]>     <dbl [2]> <dbl [50]>     <dbl [3]> <dbl [50]>     <dbl [3]>
# … with 1 more variable: model4_predict <list>

From the nested data, it can be unested. Here, we are selecting columns that have the same length (model columns are different structure)

library(tidyr)
iris1 %>% 
 mutate(across(starts_with('model'), 
  list(se = ~ list(tidy(.)$std.error), 
    predict = ~ list(predict(., data))))) %>% 
 ungroup %>% 
 select(Species, data, ends_with('predict')) %>%
 unnest(-Species)
# A tibble: 150 x 9
#   Species Sepal.Length Sepal.Width Petal.Length Petal.Width model1_predict model2_predict model3_predict model4_predict
#   <fct>          <dbl>       <dbl>        <dbl>       <dbl>          <dbl>          <dbl>          <dbl>          <dbl>
# 1 setosa           5.1         3.5          1.4         0.2           5.06           3.39          0.254          0.241
# 2 setosa           4.9         3            1.4         0.2           4.71           3.39          0.232          0.229
# 3 setosa           4.7         3.2          1.3         0.2           4.85           3.39          0.221          0.200
# 4 setosa           4.6         3.1          1.5         0.2           4.78           3.39          0.212          0.228
# 5 setosa           5           3.6          1.4         0.2           5.12           3.39          0.248          0.235
# 6 setosa           5.4         3.9          1.7         0.4           5.33           3.56          0.281          0.310
# 7 setosa           4.6         3.4          1.4         0.3           4.99           3.47          0.217          0.211
# 8 setosa           5           3.4          1.5         0.2           4.99           3.39          0.245          0.252
# 9 setosa           4.4         2.9          1.4         0.2           4.64           3.39          0.195          0.199
#10 setosa           4.9         3.1          1.5         0.1           4.78           3.31          0.233          0.246
# … with 140 more rows

Or using the se from predict

iris1 %>% 
  mutate(across(starts_with('model'), 
   list(se = ~ list(predict(., data, se = TRUE)$se.fit), 
         predict = ~ list(predict(., data))))) %>% 
  ungroup %>%
  select(-matches('^model\\d+$')) %>%
  unnest(-Species)

Upvotes: 2

Related Questions