Emman
Emman

Reputation: 4201

How to allow optional summary computations in dplyr::summarise() when writing a custom wrapper

When writing a custom wrapper function, what would be a concise way to enable/disable an additional computation within dplyr::summarise()?

For example, consider the following function that takes in data and allows the user to get the mean and sd over a specific column in the data:

library(dplyr)
library(tidyr)

get_means <- function(data, var_to_average) {
  
  data %>%
    pivot_longer(cols = {{ var_to_average }}, values_to = "response") %>%
    group_by(name) %>%
    summarise(mean = mean(response, na.rm = TRUE),
              sd = sd(response, na.rm = TRUE), .groups = "drop")
}

get_means(mtcars, mpg)

# A tibble: 1 x 3
  name   mean    sd
* <chr> <dbl> <dbl>
1 mpg    20.1  6.03

But what if I want to make the computation of sd optional?

One option would be to do a terribly repetitive code:

get_means_repetitive <- function(data, var_to_average, get_sd = NULL) {
  
  if (is.null(get_sd)) {
    data %>%
      pivot_longer(cols = {{ var_to_average }}, values_to = "response") %>%
      group_by(name) %>%
      summarise(mean = mean(response, na.rm = TRUE),
                .groups = "drop") 
    
  } else if (get_sd) {
    
    data %>%
      pivot_longer(cols = {{ var_to_average }}, values_to = "response") %>%
      group_by(name) %>%
      summarise(mean = mean(response, na.rm = TRUE),
                sd = sd(response, na.rm = TRUE), .groups = "drop")
  }

}

I want to avoid such code for several reasons. First, it's repetitive and error-prone. Second, ideally I'd like to make other parts of the function "tweakable", (i.e. could be switched on/off) and therefore I need an easy way to allow combinations of components being on/off. Relying on if-else blocks is very limiting.


Could there be a more succinct way to achieve this?

Just one idea which doesn't work in the way I put it (and I'm not even sure this is the right direction)

get_means_succinct <- function(data, var_to_average, get_sd = NULL) {
  
  if (is.null(get_sd)) {
    include_sd <- NULL
  } else {
    include_sd <- sd(response, na.rm = TRUE)
  }
  
  data %>%
    pivot_longer(cols = {{ var_to_average }}, values_to = "response") %>%
    group_by(name) %>%
    summarise(mean = mean(response, na.rm = TRUE),
              sd = include_sd, .groups = "drop")
}

Any ideas?


EDIT


Based on @G. Grothendieck's answer I'd like to highlight that my question uses sd() just as an example. I'm looking for a general coding solution that will be efficient, both in terms of code readability but also in terms of speed of code. I'd like to avoid the evaluation/calculation of optional arguments unless they were asked for (in this example it's whether to compute the sd).

Upvotes: 2

Views: 73

Answers (1)

G. Grothendieck
G. Grothendieck

Reputation: 270195

If mean and sd are just for purposes of example and in actuality represent a long calculation use an if to prevent their computation and then select out the desired columns in the last line.

(If it really were just mean and sd they are computed so fast that there is likely no point in avoiding their computation and in that case we could omit the the if's and just use the select at the end to extract the ones desired computing them even if we don't use them.)

get_means2 <- function(data, var_to_average, stats = c("mean", "sd")) {   
  data %>%
    pivot_longer(cols = {{ var_to_average }}) %>%
    group_by(name) %>%
    summarise(
      mean = if ("mean" %in% stats) mean(value, na.rm = TRUE) else NA,
      sd = if ("sd" %in% stats) sd(value, na.rm = TRUE) else NA, .groups = "drop") %>%
    select(name, stats)
}

get_means2(mtcars, mpg) # mean, sd
get_means2(mtcars, mpg, "mean") # only mean
get_means2(mtcars, mpg, "sd") # only sd

Upvotes: 1

Related Questions