Steven
Steven

Reputation: 3294

Using dplyr() to retrieve model object created via group_by() and do()

I'm trying to use dplyr and the pipe operator (%>%) to retrieve model objects stored in a dataframe.

With example data

library(dplyr)

set.seed(256)
dat <- 
  data.frame(x = rnorm(100), 
           y = rnorm(100, 10), 
           spec = sample(c("1", "2"), 100, TRUE)) %>%
  group_by(spec) %>%
  do(lm = lm(y ~ x, data = .))

I can subset and retrieve an actual model object

> dat$lm[dat$spec == "1"][[1]]

Call:
lm(formula = y ~ x, data = .)

Coefficients:
(Intercept)            x  
     9.8171      -0.2292  

> dat$lm[dat$spec == "1"][[1]] %>% class()
[1] "lm

But I think this is an inelegant way of retrieving the lm() model object contained therein, especially given that the rest of my code is structured the "dplyr way". I'd like to use dplyr but I can't figure out how. For example, using

dat %>% filter(spec == "1") %>% select(lm) 

doesn't work as it returns

Source: local data frame [1 x 1]
Groups: <by row>

# A tibble: 1 x 1
  lm      
  <list>  
1 <S3: lm>

and

dat %>% filter(spec == "1") %>% .$lm

only gets me to the first object in list, e.g.,

> dat %>% filter(spec == "1") %>% .$lm
[[1]]

Call:
lm(formula = y ~ x, data = .)

Coefficients:
(Intercept)            x  
   10.01495     -0.07438  

I can't figure out a way to get to the actual model object in the dat with dplyr. Certainly, I could use broom and tidy() to condense everything

library(broom)
tidy(dat, lm)

but this still doesn't return the actual model object:

> tidy(dat, lm)
# A tibble: 4 x 6
# Groups: spec [2]
  spec  term        estimate std.error statistic                        p.value
  <fct> <chr>          <dbl>     <dbl>     <dbl>                          <dbl>
1 1     (Intercept)  10.0        0.120    83.3                         1.91e-54
2 1     x           - 0.0744     0.111   - 0.671                       5.05e- 1
3 2     (Intercept)   9.86       0.131    75.0                         1.42e-50
4 2     x           - 0.0793     0.148   - 0.535                       5.95e- 1

I can even use dplyr to summarise() the output from a do() call and retrieve the coefficients from the models, but this still doesn't give me the model object itself:

dat %>% 
  select(spec) %>%
  bind_cols(dat %>%
              summarize(lm_i = coefficients(lm)[[1]], 
                        lm_s = coefficients(lm)[[2]]))

Is there a dplyr way to retrieve the actual model object from models created with do()?

Upvotes: 2

Views: 461

Answers (1)

alistaire
alistaire

Reputation: 43354

do returns a list column, so to extract its individual elements, you need to use list subsetting. There are various ways to do that, but in the tidyverse, purrr::pluck is a nice option to extract a single [possibly deeply nested] element:

library(tidyverse)

dat %>% pluck('lm', 1)
#> 
#> Call:
#> lm(formula = y ~ x, data = .)
#> 
#> Coefficients:
#> (Intercept)            x  
#>    10.01495     -0.07438

It's mostly equivalent to [[ subsetting, i.e.

dat[['lm']][[1]]

To get what you have to work, you need to keep subsetting, as .$lm returns the list column, which in this case is a list of a model. .[[1]] (akin to the 1 above) extracts the model from the list:

dat %>% filter(spec == "1") %>% .$lm %>% .[[1]]

or a hybrid approach, if you like:

dat %>% filter(spec == "1") %>% pluck('lm', 1)

or use pull to extract the column with NSE semantics:

dat %>% filter(spec == "1") %>% pull(lm) %>% pluck(1)

All return the same thing.

Upvotes: 5

Related Questions