Wimpel
Wimpel

Reputation: 27732

summarise based on multiple columns with a lot of conditions

sample data

df <- data.frame( id = 1:10,
                  group = c(1,1,1,1,1,2,2,2,2,2),
                  p1 = c("A", NA, "A", "A", "B", NA, NA, NA, NA, "C"),
                  p2 = c("F", NA, "G", "G", "A", "H", NA, NA, NA, NA),
                  stringsAsFactors = FALSE )

#     id group   p1   p2
#  1   1     1    A    F
#  2   2     1 <NA> <NA>
#  3   3     1    A    G
#  4   4     1    A    G
#  5   5     1    B    A
#  6   6     2 <NA>    H
#  7   7     2 <NA> <NA>
#  8   8     2 <NA> <NA>
#  9   9     2 <NA> <NA>
# 10  10     2    C <NA>

I want to summarise df by group, so that I get columns of totals from

desired output

data.frame( group = c(1,2),
            total = c(5,5),
            with_any_p = c(4,2),
            with_any_p_is_A = c(4,0),
            stringsAsFactors = FALSE)

#   group total with_any_p with_any_p_is_A
# 1     1     5          4               4
# 2     2     5          2               0

code so far

I know I can obtain the desired output using:

df %>% group_by( group ) %>% 
  summarise( total = n_distinct( id[] ),
             with_any_p = n_distinct( id[ !is.na(p1) | ! is.na(p2) ] ), 
             with_any_p_is_A = n_distinct( id[ p1 == "A" | p2 == "A" ], na.rm = TRUE ) )

# # A tibble: 2 x 4
#   group total with_any_p with_any_p_is_A
#   <dbl> <int>      <int>           <int>
# 1     1     5          4               4
# 2     2     5          2               0

question

but since my production data containt a lot of 'p-columns', I do not want to retype the above or-statements for p1-p100

I can select the desired rows/subset with filter_at:

p.cols <- paste0( "p", 1:2 )
#for with_any_p
df %>% filter_at( vars( p.cols ), any_vars( !is.na(.) ) )
#for with_any_p_is_A
df %>% filter_at( vars( p.cols ), any_vars( . == "A" ) )

but I do know now how to get these kind of selections into summarise.

Can this be done in the same 'style' as the code I already have, so that I get the desired result in one go, without having to bind/join multiple results?

Upvotes: 3

Views: 200

Answers (1)

Maurits Evers
Maurits Evers

Reputation: 50668

Here is a solution for an arbitrary number of "p" columns using an initial wide-to-long conversion

df %>%
    gather(key, val, -id, -group) %>%
    group_by(group) %>%
    summarise(
        total = n_distinct(id),
        with_any_p = n_distinct(id[!is.na(val)]),
        with_any_p_is_A = n_distinct(id[val == "A"], na.rm = T))
## A tibble: 2 x 4
#  group total with_any_p with_any_p_is_A
#  <dbl> <int>      <int>           <int>
#1     1     5          4               4
#2     2     5          2               0

A comment: I assume that all columns except id and group are "p" columns. If this is not the case you may have to change the gather statement to reflect your more general column structure.

Upvotes: 3

Related Questions