Lyndon Walker
Lyndon Walker

Reputation: 105

Counts and percentages for overlapping categories in R

I am trying to summarise sets of overlapping categories from a "here are 10 categories, pick 3" survey question. Each category has a value if they picked it and an NA if they didn't. Here is a dummy example with 3 categories:

P_A<-c("A","A",NA,NA)
P_B<-c(NA,"B","B",NA)
P_C<-c("C", NA, NA,NA)
grouper<-c("M", "M", "F","F")
data.df<-data.frame(P_A,P_B,P_C,grouper)

I am trying to create a table that shows total number select and percentage (count over number of people that selected at least one option), then also be able to subdivide by a covariate. Eg. the following table or two versions of the following table for M and F (but could be more than 2 categories in the grouper). Note, in my example, the 4th row is someone who selected no categories so is not part of the denominator for the percentage.

Variable Overall count Overall %
P_A 2 66.67
P_B 2 66.67
P_C 1 33.33

I can produce the figures using colSums(!isna(select(data.df, P_A:P_C))) and then dividing by sum(rowSums(!is.na(select(data.df, P_A:P_C)))>0) but I'm struggling to make that more generalisable for other variables and am not sure how to neatly do it when I want separate tables by the grouper variable.

Upvotes: 0

Views: 214

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388982

You can remove the rows with all NA values, get the data in long format and for each column find number and percentage of non-NA values.

library(dplyr)

data.df %>%
  select(-grouper) %>%
  filter(Reduce(`|`, across(.fns =  ~!is.na(.)))) %>%
  tidyr::pivot_longer(cols = everything()) %>%
  group_by(name) %>%
  summarise(count = sum(!is.na(value)), 
            overall_perc = count/n() * 100)

#  name  count overall_perc
#  <chr> <int>        <dbl>
#1 P_A       2         66.7
#2 P_B       2         66.7
#3 P_C       1         33.3

To do this for each category of grouper you could do :

data.df %>%
  filter(Reduce(`|`, across(-grouper, ~!is.na(.)))) %>%
  tidyr::pivot_longer(cols = -grouper) %>%
  group_by(name, grouper) %>%
  summarise(count = sum(!is.na(value)), 
            overall_perc = count/n() * 100)

Upvotes: 3

Related Questions