Reputation: 191
I do this a lot:
library(tidyverse)
iris %>%
group_by(Species) %>%
summarise(num_Species = n_distinct(Species)) %>%
mutate(perc_Species = 100 * num_Species / sum(num_Species))
So I would like to create a function that outputs the same thing but with dynamically named num_ and perc_ columns:
num_perc <- function(df, group_var, summary_var) {
}
I found this resource useful but it did not directly address how to reuse newly created column names in the way I want.
Upvotes: 2
Views: 223
Reputation: 25323
Another possible solution, which uses deparse(substitute(...))
to get the name of the function parameters as strings:
library(tidyverse)
f <- function(df, group_var, summary_var)
{
group_var <- deparse(substitute(group_var))
summary_var <- deparse(substitute(summary_var))
df %>%
group_by(!!sym(group_var)) %>%
summarise(!!str_c("num_", summary_var) := n_distinct(summary_var)) %>%
mutate(!!str_c("per_", summary_var) := 100 * !!sym(str_c("num_", summary_var)) / sum(!!sym(str_c("num_", summary_var))))
}
f(iris, Species, Species)
#> # A tibble: 3 × 3
#> Species num_Species per_Species
#> <fct> <int> <dbl>
#> 1 setosa 1 33.3
#> 2 versicolor 1 33.3
#> 3 virginica 1 33.3
Upvotes: 2
Reputation: 919
Are you sure n_distinct is what you want to do? In the case of the iris dataset, there are three Species - setosa, versicolor, virginica. Therefore, each species is 1/3 unique species. The Iris dataset is balanced in the sense that there are 50 of each species, so each species represents 1/3 of the data set but more generally this will not be the case.
A function with data masking will cover imbalanced datasets for you:
library(dplyr)
my_func <- function(df, var, percent){
df %>%
count({{var}}) %>%
mutate(percent = 100 * n/sum(n))
}
my_func(iris, Species, percent)
iris %>%
my_func(Species, percent) #or with pipe
Upvotes: 1
Reputation: 5956
What you can do is use as_label(enquo())
on your group_var
to extract variable passed as a character vector to generate your new columns. You can see a clear example of this is 6.1.3 in the linked document you sent. In this way, we can dynamically prepend num_
and perc_
to your summary variable, and just have to pass in df
and group_var
.
library(dplyr)
num_perc <- function(df, group_var) {
summary_lbl <- as_label(enquo(group_var))
num_lbl <- paste0("num_", summary_lbl)
perc_lbl <- paste0("perc_", summary_lbl)
df %>%
group_by({{ group_var }}) %>%
summarize(!!num_lbl := n_distinct({{ group_var }})) %>%
mutate(!!perc_lbl := 100 * .data[[num_lbl]] / sum(.data[[num_lbl]]))
}
num_perc(iris, Species)
#> # A tibble: 3 × 3
#> Species num_Species perc_Species
#> <fct> <int> <dbl>
#> 1 setosa 1 33.3
#> 2 versicolor 1 33.3
#> 3 virginica 1 33.3
In this case where group_var
and summary_var
actually differ, it's the same solution essentially.
num_perc <- function(df, group_var, summary_var) {
summary_lbl <- as_label(enquo(summary_var))
num_lbl <- paste0("num_", summary_lbl)
perc_lbl <- paste0("perc_", summary_lbl)
df %>%
group_by({{ group_var }}) %>%
summarize(!!num_lbl := n_distinct({{ summary_var }})) %>%
mutate(!!perc_lbl := 100 * .data[[num_lbl]] / sum(.data[[num_lbl]]))
}
num_perc(iris, Species, Species)
Upvotes: 6