Reputation: 105
I am trying to summarise sets of overlapping categories from a "here are 10 categories, pick 3" survey question. Each category has a value if they picked it and an NA if they didn't. Here is a dummy example with 3 categories:
P_A<-c("A","A",NA,NA)
P_B<-c(NA,"B","B",NA)
P_C<-c("C", NA, NA,NA)
grouper<-c("M", "M", "F","F")
data.df<-data.frame(P_A,P_B,P_C,grouper)
I am trying to create a table that shows total number select and percentage (count over number of people that selected at least one option), then also be able to subdivide by a covariate. Eg. the following table or two versions of the following table for M and F (but could be more than 2 categories in the grouper). Note, in my example, the 4th row is someone who selected no categories so is not part of the denominator for the percentage.
Variable | Overall count | Overall % |
---|---|---|
P_A | 2 | 66.67 |
P_B | 2 | 66.67 |
P_C | 1 | 33.33 |
I can produce the figures using colSums(!isna(select(data.df, P_A:P_C)))
and then dividing by sum(rowSums(!is.na(select(data.df, P_A:P_C)))>0)
but I'm struggling to make that more generalisable for other variables and am not sure how to neatly do it when I want separate tables by the grouper variable.
Upvotes: 0
Views: 214
Reputation: 388982
You can remove the rows with all NA
values, get the data in long format and for each column find number and percentage of non-NA values.
library(dplyr)
data.df %>%
select(-grouper) %>%
filter(Reduce(`|`, across(.fns = ~!is.na(.)))) %>%
tidyr::pivot_longer(cols = everything()) %>%
group_by(name) %>%
summarise(count = sum(!is.na(value)),
overall_perc = count/n() * 100)
# name count overall_perc
# <chr> <int> <dbl>
#1 P_A 2 66.7
#2 P_B 2 66.7
#3 P_C 1 33.3
To do this for each category of grouper
you could do :
data.df %>%
filter(Reduce(`|`, across(-grouper, ~!is.na(.)))) %>%
tidyr::pivot_longer(cols = -grouper) %>%
group_by(name, grouper) %>%
summarise(count = sum(!is.na(value)),
overall_perc = count/n() * 100)
Upvotes: 3