Reputation: 56
I want to calculate the relative proportion by group for every column - except the grouping column - of a data frame. However, this should be programmed once to be used with different data frames which will have a different number of columns with different names. Because I am relying heavily on dplyr in this project, I want to achive this with dplyr.
I have read this topic, regarding a similiar but less complex problem:
Use dynamic variable names in `dplyr`
and also vignette("programming", "dplyr")
but I am still not able to set the quotation correctly. I am really stuck at this point and like to have some advice of more experienced developers.
To reproduce the problem, I have set up a minimal example with a data frame with randomly created data columns and a grouping column.
library(dplyr)
library(stringi)
df <- setNames(as.data.frame(matrix(sample(1:10, 999, replace = T), 333, 3)),
stri_rand_strings(3, 10, pattern = "[A-Za-z]"))
group <- c("group1","group2","group3")
df <- cbind(df, group)
The following function should achive two things:
propsum <- function(df, expr){
expr_quo <- enquo(expr)
sum <- paste(quo_name(expr), "sum", sep = ".")
prop <- paste(quo_name(expr), "prop", sep = ".")
df %>%
group_by(., group) %>%
mutate(., !! sum := sum(!! expr_quo),
!! prop := expr / !! sum * 100) -> df
return(df)
}
for(i in length(df)-1){
propsum(df, names(df)[i]) -> df_new
}
The expected result is a data frame with the initial columns, the sums by group for every initial column and the relative proportions for every initial column by group. So in the example, the data frame should have 10 columns (1 goruping column, 3 initial data columns, 3 columns with sums by group, 3 columns with relative proportions by group).
However, I am getting the following error:
Error in sum(~names(df)[i]) : invalid 'type' (character) of argument
In the vignette, the code example for a similar task ist:
my_mutate <- function(df, expr) {
expr <- enquo(expr)
mean_name <- paste0("mean_", quo_name(expr))
sum_name <- paste0("sum_", quo_name(expr))
mutate(df,
!! mean_name := mean(!! expr),
!! sum_name := sum(!! expr)
)
}
my_mutate(df, a)
#> # A tibble: 5 x 6
#> g1 g2 a b mean_a sum_a
#> <dbl> <dbl> <int> <int> <dbl> <int>
#> 1 1 1 5 4 3 15
#> 2 1 2 3 2 3 15
#> 3 2 1 4 1 3 15
#> 4 2 2 1 3 3 15
#> # … with 1 more row
I tried a lot of different things as of now, but I am not able to get the RHS to use the correct column. What am I doing wrong?
Upvotes: 1
Views: 360
Reputation: 56
I have found a solution which I just want to share in case somebody faces a similar task.
The solution is, to call rlang::parse_expr()
explicitly to save the varnames as expressions.
Here is the working example:
library(dplyr)
library(stringi)
df <- setNames(as.data.frame(matrix(sample(1:10, 999, replace = T), 333, 3)),
stri_rand_strings(3, 10, pattern = "[A-Za-z]"))
group <- c("group1","group2","group3")
df <- cbind(df, group)
gpercentage <- function(df, a_var, p_var, sum_var){
df %>%
group_by(., group) %>%
mutate(., !! sum_var := sum(!! a_var),
!! p_var := !! a_var / sum(!! a_var)) -> df
return(df)
}
i <- 1
for(i in seq_along(1:(length(df)-1))){
a_var <- rlang::parse_expr(names(df)[i])
p_var <- rlang::parse_expr(paste(names(df)[i], "P", sep = "."))
sum_var <- rlang::parse_expr(paste(names(df)[i], "SUM", sep = "."))
df %>%
gpercentage(., a_var, p_var, sum_var) -> df
}
Upvotes: 1
Reputation: 13319
We could achieve this as follows. :
propsum <- function(df, grouping_column){
df %>%
group_by(!!sym(grouping_column)) %>%
summarise_all(list(sum,function(x)
length(x)/nrow(.) * 100)) %>%
tidyr::pivot_longer(cols=-1,
names_to = "Variable",
values_to = "Value") %>%
mutate(Variable = gsub("fn1","sum",Variable),
Variable = gsub("fn2","prop",Variable))
}
propsum(iris,"Species")
Using df
in the question:
propsum(df,"group")
# A tibble: 18 x 3
group Variable Value
<fct> <chr> <dbl>
1 group1 dVFQteFGjs_sum 628
2 group1 wiQCPUeIvC_sum 599
3 group1 yBvktNXcfd_sum 644
4 group1 dVFQteFGjs_prop 33.3
5 group1 wiQCPUeIvC_prop 33.3
6 group1 yBvktNXcfd_prop 33.3
7 group2 dVFQteFGjs_sum 630
8 group2 wiQCPUeIvC_sum 606
9 group2 yBvktNXcfd_sum 656
10 group2 dVFQteFGjs_prop 33.3
11 group2 wiQCPUeIvC_prop 33.3
12 group2 yBvktNXcfd_prop 33.3
13 group3 dVFQteFGjs_sum 636
14 group3 wiQCPUeIvC_sum 581
15 group3 yBvktNXcfd_sum 635
16 group3 dVFQteFGjs_prop 33.3
17 group3 wiQCPUeIvC_prop 33.3
18 group3 yBvktNXcfd_prop 33.3
To get back to wide(can use pivot_wider
, I find spread
"faster" to use),
propsum(df,"group") %>%
tidyr::spread(Variable,Value)
# A tibble: 3 x 7
group dVFQteFGjs_prop dVFQteFGjs_sum wiQCPUeIvC_prop wiQCPUeIvC_sum
<fct> <dbl> <dbl> <dbl> <dbl>
1 grou~ 33.3 628 33.3 599
2 grou~ 33.3 630 33.3 606
3 grou~ 33.3 636 33.3 581
# ... with 2 more variables: yBvktNXcfd_prop <dbl>,
# yBvktNXcfd_sum <dbl>
Upvotes: 0