C. Rea
C. Rea

Reputation: 131

retreiving tidy results from regression by group with broom

The answer to this question clearly explains how to retrieve tidy regression results by group when running a regression through a dplyr pipe, but the solution is no longer reproducible.

How can one use dplyr and broom in combination to run a regression by group and retrieve tidy results using R 4.02, dplyr 1.0.0, and broom 0.7.0?

Specifically, the example answer from the question linked above,

library(dplyr)
library(broom)

df.h = data.frame( 
  hour     = factor(rep(1:24, each = 21)),
  price    = runif(504, min = -10, max = 125),
  wind     = runif(504, min = 0, max = 2500),
  temp     = runif(504, min = - 10, max = 25)  
)

dfHour = df.h %>% group_by(hour) %>%
  do(fitHour = lm(price ~ wind + temp, data = .))

# get the coefficients by group in a tidy data_frame
dfHourCoef = tidy(dfHour, fitHour)

returns the following error (and three warnings) when I run it on my system:

Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) : 
  Calling var(x) on a factor x is defunct.
  Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
In addition: Warning messages:
1: Data frame tidiers are deprecated and will be removed in an upcoming release of broom. 
2: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA

If I reformat df.h$hour as a character rather than factor,

df.h <- df.h %>%
  mutate(
    hour = as.character(hour)
  )

re-run the regression by group, and again attempt to retrieve the results using broom::tidy,

dfHour = df.h %>% group_by(hour) %>%
  do(fitHour = lm(price ~ wind + temp, data = .))

# get the coefficients by group in a tidy data_frame
dfHourCoef = tidy(dfHour, fitHour)

I get this error:

Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) : 
  is.atomic(x) is not TRUE

I assume that the problem has to do with the fact that the group-level regression results are stored as lists in dfHour$fitHour, but I am unsure how to correct the error and once again tidily and quickly compile the regression results, as used to work in the originally posted code/answer.

Upvotes: 12

Views: 2947

Answers (3)

glenninboston
glenninboston

Reputation: 1045

****** Updated with more succinct code pulled from the dplyr 1.0.0 release notes ******

Thank you. I was struggling with a similar question with the update to dplyr 1.0.0 related to using the examples in the provided link. This was both a helpful question and answer.

One note as an FYI, do() has been superseded as of dplyr 1.0.0, so may consider using the updated language (now very efficient with my update):

dfHour = df.h %>% 
  # replace group_by() with nest_by() 
  # to convert your model data to a vector of lists
  nest_by(hour) %>%
  # change do() to mutate(), then add list() before your model
  # make sure to change data = .  to data = data
  mutate(fitHour = list(lm(price ~ wind + temp, data = data))) %>%
  # updated per C.Rea's comment, thanks
  reframe(tidy(fitHour))

Done!

This gives a very efficient df with select output stats. The last line replaces the following code (from my original response), which does the same thing, but less easily:

ungroup() %>%
  # then leverage the feedback from @akrun
  transmute(hour, HourCoef = map(fitHour, tidy)) %>%
  unnest(HourCoef)

dfHour

Which gives the outupt:

# A tibble: 72 x 6
   hour  term         estimate std.error statistic  p.value
   <fct> <chr>           <dbl>     <dbl>     <dbl>    <dbl>
 1 1     (Intercept) 68.6        21.0       3.27   0.00428 
 2 1     wind         0.000558    0.0124    0.0450 0.965   
 3 1     temp        -0.866       0.907    -0.954  0.353   
 4 2     (Intercept) 31.9        17.4       1.83   0.0832  
 5 2     wind         0.00950     0.0113    0.838  0.413   
 6 2     temp         1.69        0.802     2.11   0.0490  
 7 3     (Intercept) 85.5        22.3       3.83   0.00122 
 8 3     wind        -0.0210      0.0165   -1.27   0.220   
 9 3     temp         0.276       1.14      0.243  0.811   
10 4     (Intercept) 73.3        15.1       4.86   0.000126
# ... with 62 more rows

Thanks for the patience, I am working through this myself!

Upvotes: 9

akrun
akrun

Reputation: 887851

Issue would be that there is a grouping attribute rowwise after the do call and the column 'fitHour' is a list. We can ungroup, loop over the list with map and tidy it to a list column

library(dplyr)
library(purrr)
library(broom)
df.h %>% 
     group_by(hour) %>%
     do(fitHour = lm(price ~ wind + temp, data = .)) %>% 
     ungroup %>% 
     mutate(HourCoef = map(fitHour, tidy))

Or use unnest after the mtuate

df.h %>% 
      group_by(hour) %>%
      do(fitHour = lm(price ~ wind + temp, data = .)) %>% 
      ungroup %>% 
      transmute(hour, HourCoef = map(fitHour, tidy)) %>% 
      unnest(HourCoef)
# A tibble: 72 x 6
#   hour  term        estimate std.error statistic  p.value
#   <fct> <chr>          <dbl>     <dbl>     <dbl>    <dbl>
# 1 1     (Intercept) 89.8       20.2       4.45   0.000308
# 2 1     wind         0.00493    0.0151    0.326  0.748   
# 3 1     temp        -1.84       1.08     -1.71   0.105   
# 4 2     (Intercept) 75.6       23.7       3.20   0.00500 
# 5 2     wind        -0.00910    0.0146   -0.622  0.542   
# 6 2     temp         0.192      0.853     0.225  0.824   
# 7 3     (Intercept) 44.0       23.9       1.84   0.0822  
# 8 3     wind        -0.00158    0.0166   -0.0953 0.925   
# 9 3     temp         0.622      1.19      0.520  0.609   
#10 4     (Intercept) 57.8       18.9       3.06   0.00676 
# … with 62 more rows

If we wanted a single dataset, pull the 'fitHour', loop over the list with map, condense it to a single dataset by row binding (suffix _dfr)

df.h %>%
    group_by(hour) %>%  
    do(fitHour = lm(price ~ wind + temp, data = .)) %>% 
    ungroup %>% 
    pull(fitHour) %>% 
    map_dfr(tidy, .id = 'grp')

NOTE: The OP's error message was able to be replicated with R 4.02, dplyr 1.0.0 and broom 0.7.0

tidy(dfHour,fitHour)

Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) : Calling var(x) on a factor x is defunct. Use something like 'all(duplicated(x)[-1L])' to test for a constant vector. In addition: Warning messages: 1: Data frame tidiers are deprecated and will be removed in an upcoming release of broom. 2: In mean.default(X[[i]], ...) :

Upvotes: 5

Duck
Duck

Reputation: 39613

Your code actually works. Maybe package version or re starting a new R session could help:

library(dplyr)
library(broom)

df.h = data.frame( 
  hour     = factor(rep(1:24, each = 21)),
  price    = runif(504, min = -10, max = 125),
  wind     = runif(504, min = 0, max = 2500),
  temp     = runif(504, min = - 10, max = 25)  
)

dfHour = df.h %>% group_by(hour) %>%
  do(fitHour = lm(price ~ wind + temp, data = .))

tidy(dfHour,fitHour)

# A tibble: 72 x 6
# Groups:   hour [24]
   hour  term         estimate std.error statistic   p.value
   <fct> <chr>           <dbl>     <dbl>     <dbl>     <dbl>
 1 1     (Intercept) 66.4       14.8        4.48   0.000288 
 2 1     wind         0.000474   0.00984    0.0482 0.962    
 3 1     temp         0.0691     0.945      0.0731 0.943    
 4 2     (Intercept) 66.5       20.4        3.26   0.00432  
 5 2     wind        -0.00540    0.0127    -0.426  0.675    
 6 2     temp        -0.306      0.944     -0.324  0.750    
 7 3     (Intercept) 86.5       17.3        5.00   0.0000936
 8 3     wind        -0.0119     0.00960   -1.24   0.232    
 9 3     temp        -1.18       0.928     -1.27   0.221    
10 4     (Intercept) 59.8       17.5        3.42   0.00304  
# ... with 62 more rows

Upvotes: 0

Related Questions