deanpwr
deanpwr

Reputation: 191

Dplyr multiple piped dynamic variables?

I do this a lot:

library(tidyverse)

iris %>% 
  group_by(Species) %>% 
  summarise(num_Species = n_distinct(Species)) %>% 
  mutate(perc_Species = 100 * num_Species / sum(num_Species))

So I would like to create a function that outputs the same thing but with dynamically named num_ and perc_ columns:

num_perc <- function(df, group_var, summary_var) {
  
}

I found this resource useful but it did not directly address how to reuse newly created column names in the way I want.

Upvotes: 2

Views: 223

Answers (3)

PaulS
PaulS

Reputation: 25323

Another possible solution, which uses deparse(substitute(...)) to get the name of the function parameters as strings:

library(tidyverse)

f <- function(df, group_var, summary_var)
{
  group_var <- deparse(substitute(group_var))
  summary_var <- deparse(substitute(summary_var))

  df %>% 
    group_by(!!sym(group_var)) %>% 
    summarise(!!str_c("num_", summary_var) := n_distinct(summary_var)) %>% 
    mutate(!!str_c("per_", summary_var) := 100 * !!sym(str_c("num_", summary_var)) / sum(!!sym(str_c("num_", summary_var))))
}

f(iris, Species, Species)

#> # A tibble: 3 × 3
#>   Species    num_Species per_Species
#>   <fct>            <int>       <dbl>
#> 1 setosa               1        33.3
#> 2 versicolor           1        33.3
#> 3 virginica            1        33.3

Upvotes: 2

jpenzer
jpenzer

Reputation: 919

Are you sure n_distinct is what you want to do? In the case of the iris dataset, there are three Species - setosa, versicolor, virginica. Therefore, each species is 1/3 unique species. The Iris dataset is balanced in the sense that there are 50 of each species, so each species represents 1/3 of the data set but more generally this will not be the case.

A function with data masking will cover imbalanced datasets for you:

library(dplyr)
my_func <- function(df, var, percent){
  df %>%
    count({{var}}) %>%
    mutate(percent = 100 * n/sum(n))
}

my_func(iris, Species, percent)

iris %>%
  my_func(Species, percent) #or with pipe

Upvotes: 1

caldwellst
caldwellst

Reputation: 5956

What you can do is use as_label(enquo()) on your group_var to extract variable passed as a character vector to generate your new columns. You can see a clear example of this is 6.1.3 in the linked document you sent. In this way, we can dynamically prepend num_ and perc_ to your summary variable, and just have to pass in df and group_var.

library(dplyr)

num_perc <- function(df, group_var) {
  summary_lbl <- as_label(enquo(group_var))
  num_lbl <- paste0("num_", summary_lbl)
  perc_lbl <- paste0("perc_", summary_lbl)
  
  df %>%
    group_by({{ group_var }}) %>%
    summarize(!!num_lbl := n_distinct({{ group_var }})) %>%
    mutate(!!perc_lbl := 100 * .data[[num_lbl]] / sum(.data[[num_lbl]]))
}

num_perc(iris, Species)
#> # A tibble: 3 × 3
#>   Species    num_Species perc_Species
#>   <fct>            <int>        <dbl>
#> 1 setosa               1         33.3
#> 2 versicolor           1         33.3
#> 3 virginica            1         33.3

In this case where group_var and summary_var actually differ, it's the same solution essentially.

num_perc <- function(df, group_var, summary_var) {
  summary_lbl <- as_label(enquo(summary_var))
  num_lbl <- paste0("num_", summary_lbl)
  perc_lbl <- paste0("perc_", summary_lbl)
  
  df %>%
    group_by({{ group_var }}) %>%
    summarize(!!num_lbl := n_distinct({{ summary_var }})) %>%
    mutate(!!perc_lbl := 100 * .data[[num_lbl]] / sum(.data[[num_lbl]]))
}

num_perc(iris, Species, Species)

Upvotes: 6

Related Questions