CoCoL0r3s
CoCoL0r3s

Reputation: 56

Compute sum and relative proportion by group for any number of columns with random names using dplyr

I want to calculate the relative proportion by group for every column - except the grouping column - of a data frame. However, this should be programmed once to be used with different data frames which will have a different number of columns with different names. Because I am relying heavily on dplyr in this project, I want to achive this with dplyr.

I have read this topic, regarding a similiar but less complex problem: Use dynamic variable names in `dplyr` and also vignette("programming", "dplyr") but I am still not able to set the quotation correctly. I am really stuck at this point and like to have some advice of more experienced developers.

To reproduce the problem, I have set up a minimal example with a data frame with randomly created data columns and a grouping column.

library(dplyr)
library(stringi)

df <- setNames(as.data.frame(matrix(sample(1:10, 999, replace = T), 333, 3)), 
               stri_rand_strings(3, 10, pattern = "[A-Za-z]"))

group <- c("group1","group2","group3")

df <- cbind(df, group)

The following function should achive two things:

  1. calculate the sum of every column in the data frame by group
  2. calculate the relative proportions of every column in the data frame by group
propsum <- function(df, expr){

  expr_quo <- enquo(expr)

  sum <- paste(quo_name(expr), "sum", sep = ".")
  prop <- paste(quo_name(expr), "prop", sep = ".")

  df %>%
    group_by(., group) %>%
    mutate(., !! sum :=  sum(!! expr_quo),
              !! prop := expr / !! sum * 100) -> df

  return(df)
}

for(i in length(df)-1){
  propsum(df, names(df)[i]) -> df_new
}

The expected result is a data frame with the initial columns, the sums by group for every initial column and the relative proportions for every initial column by group. So in the example, the data frame should have 10 columns (1 goruping column, 3 initial data columns, 3 columns with sums by group, 3 columns with relative proportions by group).

However, I am getting the following error:

Error in sum(~names(df)[i]) : invalid 'type' (character) of argument

In the vignette, the code example for a similar task ist:

my_mutate <- function(df, expr) {
  expr <- enquo(expr)
  mean_name <- paste0("mean_", quo_name(expr))
  sum_name <- paste0("sum_", quo_name(expr))

  mutate(df,
    !! mean_name := mean(!! expr),
    !! sum_name := sum(!! expr)
  )
}

my_mutate(df, a)
#> # A tibble: 5 x 6
#>      g1    g2     a     b mean_a sum_a
#>   <dbl> <dbl> <int> <int>  <dbl> <int>
#> 1     1     1     5     4      3    15
#> 2     1     2     3     2      3    15
#> 3     2     1     4     1      3    15
#> 4     2     2     1     3      3    15
#> # … with 1 more row

I tried a lot of different things as of now, but I am not able to get the RHS to use the correct column. What am I doing wrong?

Upvotes: 1

Views: 360

Answers (2)

CoCoL0r3s
CoCoL0r3s

Reputation: 56

I have found a solution which I just want to share in case somebody faces a similar task. The solution is, to call rlang::parse_expr() explicitly to save the varnames as expressions.

Here is the working example:

library(dplyr)
library(stringi)

df <- setNames(as.data.frame(matrix(sample(1:10, 999, replace = T), 333, 3)), 
               stri_rand_strings(3, 10, pattern = "[A-Za-z]"))

group <- c("group1","group2","group3")

df <- cbind(df, group)

gpercentage <- function(df, a_var, p_var, sum_var){

  df %>%
    group_by(., group) %>%
    mutate(., !! sum_var := sum(!! a_var),
              !! p_var := !! a_var / sum(!! a_var)) -> df

  return(df)
}

i <- 1

for(i in seq_along(1:(length(df)-1))){

  a_var <- rlang::parse_expr(names(df)[i])
  p_var <- rlang::parse_expr(paste(names(df)[i], "P", sep = "."))
  sum_var <- rlang::parse_expr(paste(names(df)[i], "SUM", sep = "."))

df %>%
  gpercentage(., a_var, p_var, sum_var) -> df
}

Upvotes: 1

NelsonGon
NelsonGon

Reputation: 13319

We could achieve this as follows. :

propsum <- function(df, grouping_column){

 df %>%
    group_by(!!sym(grouping_column)) %>%
    summarise_all(list(sum,function(x) 
      length(x)/nrow(.) * 100)) %>% 
    tidyr::pivot_longer(cols=-1,
                        names_to = "Variable",
                        values_to = "Value") %>% 
    mutate(Variable = gsub("fn1","sum",Variable),
           Variable = gsub("fn2","prop",Variable))

}
propsum(iris,"Species") 

Using df in the question:

propsum(df,"group")  
# A tibble: 18 x 3
   group  Variable        Value
   <fct>  <chr>           <dbl>
 1 group1 dVFQteFGjs_sum  628  
 2 group1 wiQCPUeIvC_sum  599  
 3 group1 yBvktNXcfd_sum  644  
 4 group1 dVFQteFGjs_prop  33.3
 5 group1 wiQCPUeIvC_prop  33.3
 6 group1 yBvktNXcfd_prop  33.3
 7 group2 dVFQteFGjs_sum  630  
 8 group2 wiQCPUeIvC_sum  606  
 9 group2 yBvktNXcfd_sum  656  
10 group2 dVFQteFGjs_prop  33.3
11 group2 wiQCPUeIvC_prop  33.3
12 group2 yBvktNXcfd_prop  33.3
13 group3 dVFQteFGjs_sum  636  
14 group3 wiQCPUeIvC_sum  581  
15 group3 yBvktNXcfd_sum  635  
16 group3 dVFQteFGjs_prop  33.3
17 group3 wiQCPUeIvC_prop  33.3
18 group3 yBvktNXcfd_prop  33.3

To get back to wide(can use pivot_wider, I find spread "faster" to use),

    propsum(df,"group") %>% 
  tidyr::spread(Variable,Value)
# A tibble: 3 x 7
  group dVFQteFGjs_prop dVFQteFGjs_sum wiQCPUeIvC_prop wiQCPUeIvC_sum
  <fct>           <dbl>          <dbl>           <dbl>          <dbl>
1 grou~            33.3            628            33.3            599
2 grou~            33.3            630            33.3            606
3 grou~            33.3            636            33.3            581
# ... with 2 more variables: yBvktNXcfd_prop <dbl>,
#   yBvktNXcfd_sum <dbl>

Upvotes: 0

Related Questions