summarise based on multiple columns with a lot of conditions

Question

sample data

df <- data.frame( id = 1:10,
                  group = c(1,1,1,1,1,2,2,2,2,2),
                  p1 = c("A", NA, "A", "A", "B", NA, NA, NA, NA, "C"),
                  p2 = c("F", NA, "G", "G", "A", "H", NA, NA, NA, NA),
                  stringsAsFactors = FALSE )

#     id group   p1   p2
#  1   1     1    A    F
#  2   2     1  
#  3   3     1    A    G
#  4   4     1    A    G
#  5   5     1    B    A
#  6   6     2     H
#  7   7     2  
#  8   8     2  
#  9   9     2  
# 10  10     2    C

I want to summarise df by group, so that I get columns of totals from

unique id's
unique id's where any of the p-columns value is not NA
unique id's where any of the p-columns value is equal to "A"

desired output

data.frame( group = c(1,2),
            total = c(5,5),
            with_any_p = c(4,2),
            with_any_p_is_A = c(4,0),
            stringsAsFactors = FALSE)

#   group total with_any_p with_any_p_is_A
# 1     1     5          4               4
# 2     2     5          2               0

code so far

I know I can obtain the desired output using:

df %>% group_by( group ) %>% 
  summarise( total = n_distinct( id[] ),
             with_any_p = n_distinct( id[ !is.na(p1) | ! is.na(p2) ] ), 
             with_any_p_is_A = n_distinct( id[ p1 == "A" | p2 == "A" ], na.rm = TRUE ) )

# # A tibble: 2 x 4
#   group total with_any_p with_any_p_is_A
#                     
# 1     1     5          4               4
# 2     2     5          2               0

question

but since my production data containt a lot of 'p-columns', I do not want to retype the above or-statements for p1-p100

I can select the desired rows/subset with filter_at:

p.cols <- paste0( "p", 1:2 )
#for with_any_p
df %>% filter_at( vars( p.cols ), any_vars( !is.na(.) ) )
#for with_any_p_is_A
df %>% filter_at( vars( p.cols ), any_vars( . == "A" ) )

but I do know now how to get these kind of selections into summarise.

Can this be done in the same 'style' as the code I already have, so that I get the desired result in one go, without having to bind/join multiple results?

Maurits Evers · Accepted Answer

Here is a solution for an arbitrary number of "p" columns using an initial wide-to-long conversion

df %>%
    gather(key, val, -id, -group) %>%
    group_by(group) %>%
    summarise(
        total = n_distinct(id),
        with_any_p = n_distinct(id[!is.na(val)]),
        with_any_p_is_A = n_distinct(id[val == "A"], na.rm = T))
## A tibble: 2 x 4
#  group total with_any_p with_any_p_is_A
#                    
#1     1     5          4               4
#2     2     5          2               0

A comment: I assume that all columns except id and group are "p" columns. If this is not the case you may have to change the gather statement to reflect your more general column structure.

summarise based on multiple columns with a lot of conditions

Answers (1)

Related Questions