jakes
jakes

Reputation: 2095

How to apply a function that takes data frame as input with purrr

With the data like this:

df <- tibble(x = runif(200), y = runif(200, 0, 3), is_active = sample(c(0, 1), size = 200, replace = TRUE, prob = c(0.2, 0.8)), 
             var1 = sample(c(0, 1), 200, TRUE), var2 = sample(c(0, 1), 200, TRUE))

# A tibble: 6 x 5
       x     y is_active  var1  var2
   <dbl> <dbl>     <dbl> <dbl> <dbl>
1 0.0812 2.42          0     0     0
2 0.313  1.61          0     1     1
3 0.241  2.90          1     0     0
4 0.906  1.08          1     0     1
5 0.652  2.86          0     0     0
6 0.231  0.730         1     1     0
...

I want to calculate the proportion of is_active column only for those observations where var1==1, then for those where var2==1 etc. I have written a function that is applicable to one variable:

f <- function(df, var){
  var <- ensym(var)

  df %>%
    filter(!!var == 1) %>%
    mutate(xcut = cut(x, breaks = 10),
           ycut = cut(y, breaks = 20)) %>%
    group_by(xcut, ycut) %>%
    summarise(!!paste(var, 'proportion', sep = '_') := mean(is_active)) %>%
    ungroup()

}

And calling it as below works fine:

f(df, var1)
f(df, var2)

The issue is that I have a hundreds of columns like var1 and var2 and I'd like to iterate over all of them, calculating a defined proportion of is_active for each of them. map_at(df, vars(var1, var2), f) doesn't work here as it is applied to subsequent columns (vectors) and doesn't take a whole data frame as input for each call. How can I achieve it, preferably with purrr package?

Upvotes: 2

Views: 113

Answers (2)

Andre&#233;
Andre&#233;

Reputation: 69

I would do something like this

calc_pct_isactive <- function(df, regex_col = "^var") {
    require(tidyverse)

    df %>% 
        pivot_longer(cols = matches(regex_col)) %>%
        group_by(is_active, name, value) %>%
        tally(name = "count") %>% 
        group_by(is_active, name) %>%
        mutate(base = sum(count,na.rm = TRUE),
             pct = count/base) %>%
        filter(is_active ==1, value ==1)

}
calc_pct_isactive(df)

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 389235

You could pass the input to your function as string and modify the function a little as :

library(tidyverse)

f <- function(df, var){

   df %>%
    filter(!!sym(var) == 1) %>%
    mutate(xcut = cut(x, breaks = 10),
           ycut = cut(y, breaks = 20)) %>%
    group_by(xcut, ycut) %>%
    summarise(!!paste(var, 'proportion', sep = '_') := mean(is_active)) %>%
    ungroup()  
}

you can then do

map(c('var1', 'var2'), f, df = df)

#[[1]]
# A tibble: 2 x 3
#  xcut          ycut          var1_proportion
#  <fct>         <fct>                   <dbl>
#1 (0.231,0.239] (0.729,0.774]               1
#2 (0.305,0.313] (1.57,1.61]                 0

#[[2]]
# A tibble: 2 x 3
#  xcut          ycut        var2_proportion
#  <fct>         <fct>                 <dbl>
#1 (0.312,0.372] (1.58,1.61]               0
#2 (0.847,0.907] (1.08,1.11]               1

Upvotes: 2

Related Questions