Romain
Romain

Reputation: 2151

How do I combine varying input variables and varying functions in dplyr summarise

I need to group and summarise a dataframe, using different summarise functions depending on the variable I'm summarising. Those functions can have different main and optional arguments and I'd like to code a function which can do all of that.

Here are the simpler functions I've managed to code, just to show the logic of it.

require(tidyverse)
require(magrittr)
require(rlang)

example <- data.frame(y = as.factor(c('A','B','C','A','B')),
                      x1 = c(7, 10, NA, NA, 2),
                      x2 = c(13, 0, 0, 2, 1),
                      z = c(0, 1, 0, 1, 0))

# Summarise variables with common prefix
do_summary_prefix <- function(dataset, y, prefix, fun, ...){
    y <- enquo(y)
    prefix <- quo_name(enquo(prefix))
    fun <- match.fun(fun)
    dataset  %<>%  
       group_by(!!y) %>% 
       summarise_at(vars(starts_with(prefix)), funs(fun), ...) %>% 
       ungroup()
    return(dataset)
}
do_summary_prefix(example, y, x, 'quantile', probs = 0.25, na.rm = T) 

# Summarise variables with different names, one at a time
do_summary_x <- function(dataset, y, x, fun, ...){
    y <- enquo(y)
    x <- enquo(x)

    dataset  %<>%  
       group_by(!!y) %>% 
       summarise(!!paste(quo_name(x), fun, sep = '_') := do.call(match.fun(fun), list(x = !!x, ...))) %>% 
       ungroup()
    return(dataset)
}
do_summary_x(example, y, x1, fun = 'mean', na.rm = F)

This is ok for me, and I could use do_summary_x in a sort of loop over the variables I want to summarise to get the job done. But I'd like to integrate the loop in a higher level function, making use of ... while still being able to use varying parameters for my summarising functions.

I know I can't use ... for different sublevel functions, so I'll pass one of the previous (either my variables either the functions parameters) as a list, and use do.call. It is more natural to me to keep ... for input variables and add parameters, always named, with a list. This what I've come to :

#install.packages('plyr') # if needed
join_all <- plyr::join_all

do_summary <- function(dataset, y, ..., fun, other_args = list(NULL = 
    NULL)){
    y_quo <- enquo(y)
    y_name <- quo_name(y_quo)

    values <- quos(...)

    datasets <- lapply(values, function(value){
      summarised_data <- dataset %>% 
      group_by(!!y_quo) %>% 
      summarise(calcul = do.call(fun, 
                                 unlist(list(list(x = !!value),
                                             other_args),
                                        recursive = F))) %>%
      ungroup() %>%
      rename(!!paste(quo_name(value), stat, sep = '_') := calcul)
    return(summarised_data)
  })
  finished <- join_all(datasets, by = y_name, type = 'left')
  return(finished)
}
do_summary(example, y,
           x1, x2, z,
           stat = 'quantile',
           other_args = list(probs = 0.1, na.rm = T))
do_summary(example, y,
           x1, x2, z,
           fun = 'mean')

This is working fine so I'm happy with it overall, but this works only with functions that have a x first argument.

Suppose I want to be able also to change the name of the first argument of the fun, namely x here. How do I do ?

I haven't found a solution to quote then inject in the do.call something like changing_arg = !!x, or make sensible use of list(!!changing_arg := !!x)

Upvotes: 4

Views: 269

Answers (1)

acylam
acylam

Reputation: 18681

Here is how I would simplify your function:

library(dplyr)
library(rlang)

do_summary <- function(dataset, y, ..., fun, other_args = list(NULL = NULL)){

  y_quo <- enquo(y)
  values <- quos(...)

  datasets <- dataset %>% 
      group_by(!!y_quo) %>% 
      summarise_at(vars(!!!values), .funs = fun, !!!other_args) %>%
      rename_at(vars(!!!values), paste, fun, sep = "_")

  return(datasets)
}

do_summary(example, y,
           x1, x2, z,
           fun = 'quantile',
           other_args = list(probs = 0.1, na.rm = T))

do_summary(example, y,
           x1, x2, z,
           fun = 'mean')

Result:

# A tibble: 3 x 4
       y x1_quantile x2_quantile z_quantile
  <fctr>       <dbl>       <dbl>      <dbl>
1      A         7.0         3.1        0.1
2      B         2.8         0.1        0.1
3      C          NA         0.0        0.0

# A tibble: 3 x 4
       y x1_mean x2_mean z_mean
  <fctr>   <dbl>   <dbl>  <dbl>
1      A      NA     7.5    0.5
2      B       6     0.5    0.5
3      C      NA     0.0    0.0

Notes:

  1. Instead of using lapply looping over every values, You can just simply use summarise_at and rename_at and supply values to vars by explicit splicing using !!!.

  2. fun is then supplied to .funs argument for summarise_at, and again, you can explicitly splice other_args with !!!. For example, list(probs = 0.1, na.rm = T) turns into probs = 0.1, na.rm = T.

  3. Same idea for rename_at. Use vars and explicitly splice values. An alternative would be to write rename_at(vars(-y_name), ...) since summarise_at returns only grouping columns and summary columns.

  4. This method gets rid of lapply, the awkward do.call in summarise and the join_all at the end (y_name thus also not needed).

  5. Your do_summary call at the end for quantile seems to be a typo, instead of stat = "quantile", I think you meant fun = "quantile"

  6. Note that this function only works if you supply the function name in the form of a string.

Upvotes: 2

Related Questions