Reputation: 27732
sample data
df <- data.frame( id = 1:10,
group = c(1,1,1,1,1,2,2,2,2,2),
p1 = c("A", NA, "A", "A", "B", NA, NA, NA, NA, "C"),
p2 = c("F", NA, "G", "G", "A", "H", NA, NA, NA, NA),
stringsAsFactors = FALSE )
# id group p1 p2
# 1 1 1 A F
# 2 2 1 <NA> <NA>
# 3 3 1 A G
# 4 4 1 A G
# 5 5 1 B A
# 6 6 2 <NA> H
# 7 7 2 <NA> <NA>
# 8 8 2 <NA> <NA>
# 9 9 2 <NA> <NA>
# 10 10 2 C <NA>
I want to summarise df
by group, so that I get columns of totals from
NA
desired output
data.frame( group = c(1,2),
total = c(5,5),
with_any_p = c(4,2),
with_any_p_is_A = c(4,0),
stringsAsFactors = FALSE)
# group total with_any_p with_any_p_is_A
# 1 1 5 4 4
# 2 2 5 2 0
code so far
I know I can obtain the desired output using:
df %>% group_by( group ) %>%
summarise( total = n_distinct( id[] ),
with_any_p = n_distinct( id[ !is.na(p1) | ! is.na(p2) ] ),
with_any_p_is_A = n_distinct( id[ p1 == "A" | p2 == "A" ], na.rm = TRUE ) )
# # A tibble: 2 x 4
# group total with_any_p with_any_p_is_A
# <dbl> <int> <int> <int>
# 1 1 5 4 4
# 2 2 5 2 0
question
but since my production data containt a lot of 'p-columns', I do not want to retype the above or-statements for p1-p100
I can select the desired rows/subset with filter_at
:
p.cols <- paste0( "p", 1:2 )
#for with_any_p
df %>% filter_at( vars( p.cols ), any_vars( !is.na(.) ) )
#for with_any_p_is_A
df %>% filter_at( vars( p.cols ), any_vars( . == "A" ) )
but I do know now how to get these kind of selections into summarise.
Can this be done in the same 'style' as the code I already have, so that I get the desired result in one go, without having to bind/join multiple results?
Upvotes: 3
Views: 200
Reputation: 50668
Here is a solution for an arbitrary number of "p"
columns using an initial wide-to-long conversion
df %>%
gather(key, val, -id, -group) %>%
group_by(group) %>%
summarise(
total = n_distinct(id),
with_any_p = n_distinct(id[!is.na(val)]),
with_any_p_is_A = n_distinct(id[val == "A"], na.rm = T))
## A tibble: 2 x 4
# group total with_any_p with_any_p_is_A
# <dbl> <int> <int> <int>
#1 1 5 4 4
#2 2 5 2 0
A comment: I assume that all columns except id
and group
are "p"
columns. If this is not the case you may have to change the gather
statement to reflect your more general column structure.
Upvotes: 3