Adding extra variables to summarize_at in the R dplyr package

Question

I am using R's dplyr package to obtain summary statistics for several variables, many of which are all similarly named. For example, I have several hundred variables that are all named doctor_id_X, where X is some integer (e.g. doctor_id_1, doctor_id_33, doctor_id_543, etc.). I am able to easily obtain the maximum value of these "doctor_id" variables, grouped by treatment1, using so-called "helpers" inside a summarize_at statement like this:

mydf %>% group_by(treatment1) %>% 
summarize_at(vars(contains("doctor_id")), max)

However, in addition to obtaining these max statistics, I'm trying to obtain the mean of some other variable that is not named similar to one of the doctor_id variables, call it procedure_time. Is there a way for me to do this efficiently by modifying my code above?

To better explain, what I'm trying to do, here's a toy example of a dataframe:

library(dplyr)
set.seed(20190813)
mydf <- data.frame(treatment1 = sample(LETTERS[1:3], 100, replace=TRUE),
                   treatment2 = sample(LETTERS[4:5], 100, replace=TRUE), 
                   doctor_id_1=rbinom(100, 1, .01),
                   doctor_id_2=rbinom(100, 1, .5),
                   doctor_id_3=rbinom(100, 1, .6),
                   doctor_id_4=rbinom(100, 1, .7),
                   y=rnorm(100))
    head(mydf)

  treatment1 treatment2 doctor_id_1 doctor_id_2 doctor_id_3 doctor_id_4          y
1          A          D           0           0           1           0 -1.1175362
2          A          E           0           0           1           1 -2.2813598
3          A          D           0           1           1           0  0.5886914
4          A          D           0           0           1           1  0.9854405
5          B          E           0           0           1           1  1.8831306
6          A          E           0           1           1           1 -0.3875261

It's no problem getting the max value summaries by the doctor_id variables with this:

mydf %>% group_by(treatment1) %>% 
   summarize_at(vars(contains("doctor_id")), max)

# A tibble: 3 x 5
  treatment1 doctor_id_1 doctor_id_2 doctor_id_3 doctor_id_4
                                   
1 A                    1           1           1           1
2 B                    0           1           1           1
3 C                    0           1           1           1

But now, I want to also find the mean of y in my summarize statement. I tried this but it did not work:

mydf %>% group_by(treatment1) %>% 
   summarize_at(y_avg=mean(y), vars(y, contains("doctor_id")), max)

I did note, however, that the following gets me a little closer to what I want, only it's producing the max for all the doctor_id variable and the y variable, but I need just the mean for the y variable.

mydf %>% group_by(treatment1) %>% 
summarize_at(vars(y, contains("doctor_id")), max)

# A tibble: 3 x 6
  treatment1     y doctor_id_1 doctor_id_2 doctor_id_3 doctor_id_4
                                    
1 A           1.70           1           1           1           1
2 B           1.88           0           1           1           1
3 C           1.45           0           1           1           1

This also gets me close, but it produces both statistics for both variables:

mydf %>% group_by(treatment1) %>% 
   summarize_at(vars(y, contains("doctor_id")), c(max, mean))

So, to summarize, my question is, is there any way to use the helper function in a summarize_by statement and also include a different statistic for another variable, without having to resort to joins?

akrun · Accepted Answer

Here is one option where we create a variable 'y_avg' either with mutate (and then add that also in the group_by) or directly in group_by after the 'treatment' grouped data, then do the summarise_at

library(dplyr)
mydf %>%
   group_by(treatment1)  %>%
   group_by(y_avg = mean(y), add = TRUE ) %>% 
   summarize_at(vars(contains("doctor_id")), max)
# A tibble: 3 x 6
# Groups:   treatment1 [3]
#  treatment1    y_avg doctor_id_1 doctor_id_2 doctor_id_3 doctor_id_4
#                                       
#1 A          -0.216             0           1           1           1
#2 B           0.0659            0           1           1           1
#3 C          -0.00830           1           1           1           1

or another option is to create the the 'y_avg' in mutate and then use it in group_by

mydf %>%
   group_by(treatment1) %>%
   mutate(y_avg = mean(y) %>%
   group_by(y_avg, add = TRUE) %>%
   summarize_at(vars(contains("doctor_id")), max)

Or another option is to select only the variable of interest, after grouping by 'treatment' use mutate and mutate_at to update the columns, then the distinct rows of the dataset

mydf %>%
   select(treatment1, y_avg = y, contains('doctor_id')) %>% 
   group_by(treatment1) %>% 
   mutate(y_avg = mean(y_avg)) %>% 
   mutate_at(vars(contains('doctor_id')), max) %>%
   distinct

Or an option with data.table

library(data.table)
setDT(mydf)[, c(.(y_avg = mean(y)), lapply(.SD, max)),
        .(treatment1), .SDcols = grep('doctor_id', names(mydf))]
#.  treatment1        y_avg doctor_id_1 doctor_id_2 doctor_id_3 doctor_id_4
#1:          C -0.008299684           1           1           1           1
#2:          B  0.065875911           0           1           1           1
#3:          A -0.216200359           0           1           1           1

Adding extra variables to summarize_at in the R dplyr package

Answers (2)

Related Questions