Reputation: 2151
I need to group and summarise a dataframe, using different summarise functions depending on the variable I'm summarising. Those functions can have different main and optional arguments and I'd like to code a function which can do all of that.
Here are the simpler functions I've managed to code, just to show the logic of it.
require(tidyverse)
require(magrittr)
require(rlang)
example <- data.frame(y = as.factor(c('A','B','C','A','B')),
x1 = c(7, 10, NA, NA, 2),
x2 = c(13, 0, 0, 2, 1),
z = c(0, 1, 0, 1, 0))
# Summarise variables with common prefix
do_summary_prefix <- function(dataset, y, prefix, fun, ...){
y <- enquo(y)
prefix <- quo_name(enquo(prefix))
fun <- match.fun(fun)
dataset %<>%
group_by(!!y) %>%
summarise_at(vars(starts_with(prefix)), funs(fun), ...) %>%
ungroup()
return(dataset)
}
do_summary_prefix(example, y, x, 'quantile', probs = 0.25, na.rm = T)
# Summarise variables with different names, one at a time
do_summary_x <- function(dataset, y, x, fun, ...){
y <- enquo(y)
x <- enquo(x)
dataset %<>%
group_by(!!y) %>%
summarise(!!paste(quo_name(x), fun, sep = '_') := do.call(match.fun(fun), list(x = !!x, ...))) %>%
ungroup()
return(dataset)
}
do_summary_x(example, y, x1, fun = 'mean', na.rm = F)
This is ok for me, and I could use do_summary_x
in a sort of loop over the variables I want to summarise to get the job done. But I'd like to integrate the loop in a higher level function, making use of ...
while still being able to use varying parameters for my summarising functions.
I know I can't use ...
for different sublevel functions, so I'll pass one of the previous (either my variables either the functions parameters) as a list, and use do.call
. It is more natural to me to keep ...
for input variables and add parameters, always named, with a list. This what I've come to :
#install.packages('plyr') # if needed
join_all <- plyr::join_all
do_summary <- function(dataset, y, ..., fun, other_args = list(NULL =
NULL)){
y_quo <- enquo(y)
y_name <- quo_name(y_quo)
values <- quos(...)
datasets <- lapply(values, function(value){
summarised_data <- dataset %>%
group_by(!!y_quo) %>%
summarise(calcul = do.call(fun,
unlist(list(list(x = !!value),
other_args),
recursive = F))) %>%
ungroup() %>%
rename(!!paste(quo_name(value), stat, sep = '_') := calcul)
return(summarised_data)
})
finished <- join_all(datasets, by = y_name, type = 'left')
return(finished)
}
do_summary(example, y,
x1, x2, z,
stat = 'quantile',
other_args = list(probs = 0.1, na.rm = T))
do_summary(example, y,
x1, x2, z,
fun = 'mean')
This is working fine so I'm happy with it overall, but this works only with functions that have a x
first argument.
Suppose I want to be able also to change the name of the first argument of the fun
, namely x
here. How do I do ?
I haven't found a solution to quote then inject in the do.call
something like changing_arg = !!x
, or make sensible use of list(!!changing_arg := !!x)
Upvotes: 4
Views: 269
Reputation: 18681
Here is how I would simplify your function:
library(dplyr)
library(rlang)
do_summary <- function(dataset, y, ..., fun, other_args = list(NULL = NULL)){
y_quo <- enquo(y)
values <- quos(...)
datasets <- dataset %>%
group_by(!!y_quo) %>%
summarise_at(vars(!!!values), .funs = fun, !!!other_args) %>%
rename_at(vars(!!!values), paste, fun, sep = "_")
return(datasets)
}
do_summary(example, y,
x1, x2, z,
fun = 'quantile',
other_args = list(probs = 0.1, na.rm = T))
do_summary(example, y,
x1, x2, z,
fun = 'mean')
Result:
# A tibble: 3 x 4
y x1_quantile x2_quantile z_quantile
<fctr> <dbl> <dbl> <dbl>
1 A 7.0 3.1 0.1
2 B 2.8 0.1 0.1
3 C NA 0.0 0.0
# A tibble: 3 x 4
y x1_mean x2_mean z_mean
<fctr> <dbl> <dbl> <dbl>
1 A NA 7.5 0.5
2 B 6 0.5 0.5
3 C NA 0.0 0.0
Notes:
Instead of using lapply
looping over every values
, You can just simply use summarise_at
and rename_at
and supply values
to vars
by explicit splicing using !!!
.
fun
is then supplied to .funs
argument for summarise_at
, and again, you can explicitly splice other_args
with !!!
. For example, list(probs = 0.1, na.rm = T)
turns into probs = 0.1, na.rm = T
.
Same idea for rename_at
. Use vars
and explicitly splice values
. An alternative would be to write rename_at(vars(-y_name), ...)
since summarise_at
returns only grouping columns and summary columns.
This method gets rid of lapply
, the awkward do.call
in summarise
and the join_all
at the end (y_name
thus also not needed).
Your do_summary
call at the end for quantile
seems to be a typo, instead of stat = "quantile"
, I think you meant fun = "quantile"
Note that this function only works if you supply the function name in the form of a string.
Upvotes: 2