dplyr summarise for multiple input values for a user defined function

Question

I have a dataframe df for which I want to identify the proportion of unique values in col1 which satisfies a condition in col2.

set.seed(137)
df <- data.frame(col1 = sample(LETTERS, 100, TRUE), 
                 col2 = sample(-75:75, 100, TRUE), 
                 col3 = sample(-75:75, 100, TRUE))

df$col2[c(23, 48, 78)] <- NA
df$col3[c(37, 68, 81)] <- NA

For example, I want to find all the unique values in col1 which have values in col2 within the range of -10 to 10 inclusive.

df %>%  
  mutate(unqCol1 = n_distinct(col1)) %>% 
  group_by(col1) %>% 
  mutate(freq = sum(between(col2, -10, 10), na.rm = TRUE)) %>% 
  filter(freq > 0) %>% distinct(col1, unqCol1) %>% 
  ungroup() %>%  
  summarise(nrow(.)/unqCol1) %>% 
  unique()

which results in:

# A tibble: 1 x 1
  `nrow(.)/unqCol1`
              
1             0.423

Though the above code snippet is not an efficient way of doing it, I tried to achieve the result in single piped-command and it provides me the right output (any clever ways of rewriting the above code are highly appreciatable). I have reconfirmed the output using a base R approach:

length(unique(df$col1[df$col2 >= -10 & df$col2 <= 10 & !is.na(df$col2)]))/length(unique(df$col1))

I would like to re-write the above dplyr code within a function so that it could be replicated with multiple values of n (here: n=10) for the range (for multiple columns too). Is this possible? Or should I pass multiple values within the code itself (without function) like apply-family idea?

Martin C. Arnold · Accepted Answer

As you've noticed, your (dplyr) code is overly complicated. You can compute the proportion of interest without grouping the data:

df %>% 
  tidyr::drop_na() %>%
  filter(between(col2, -10, 10)) %>% 
  summarize(prop = n_distinct(col1) / n_distinct(df$col1))

A function for computing the proportion is:

my_summary <- function(df, ...) {
   df %>% 
     tidyr::drop_na() %>%
     filter(...) %>% 
     summarize(
       prop = n_distinct(col1) / n_distinct(df$col1)
     )
}

E.g.

> my_summary(df, between(col2, -10, 10))
       prop
1 0.4230769

gives the proportion in your question.

EDIT

You can vectorize my_summary() and use outer() to get a matrix of proportions for combinations of col and n:

my_summary <- function(col, n) {
  df %>% 
    tidyr::drop_na() %>%
    filter(between(!!as.name(col), -n, n)) %>% 
    summarize(
      prop = n_distinct(col1) / n_distinct(df$col1)
    )
}
my_summary_v <- Vectorize(my_summary)

> outer(c("col2", "col3"), c(10, 20, 30), my_summary_v)
     [,1]      [,2]      [,3]     
[1,] 0.4230769 0.5384615 0.6538462
[2,] 0.4230769 0.6538462 0.6923077

dplyr summarise for multiple input values for a user defined function

Answers (1)

Related Questions