Summarising levels of multiple factor variables

Question

I have searched for similar questions, but cannot find the exact solution required. This question is somewhat similar, but only deals with the issue of summarising multiple continuous variables, not factors.

I have a dataframe consisting of 4 factor variables (sex, agegroup, hiv, group), e.g.

set.seed(20150710)
df<-data.frame(sex=as.factor(c(sample(1:2, 10000, replace=T))), 
           agegroup=as.factor(c(sample(1:5,10000, replace=T))),
           hiv=as.factor(c(sample(1:3,10000, replace=T))),
           group=as.factor(c(sample(1:2,10000, replace=T)))
           )

levels(df$sex)<- c("Male", "Female")
levels(df$agegroup)<- c("16-24", "25-34", "35-44", "45-54", "55+")
levels(df$hiv)<-c("Positive", "Negative", "Not tested")
levels(df$group)<-c("Intervention", "Control")

I would like to create a summary table giving counts and proportions for each level of the exposure variables sex, agegroup and hiv, stratified by group.

EDIT: this is what I am aiming for:

                X N_Control Percent_Control N_Intervention     Percent_Intervention
1      sex_Female      2517       0.5041057           2480            0.4953066
2        sex_Male      2476       0.4958943           2527            0.5046934
3  agegroup_16-24      1005       0.2012818            992            0.1981226
4  agegroup_25-34      1001       0.2004807            996            0.1989215
5  agegroup_35-44      1010       0.2022832            997            0.1991212
6  agegroup_45-54       976       0.1954737            996            0.1989215
7    agegroup_55+      1001       0.2004807           1026            0.2049131
8    hiv_Negative      1679       0.3362708           1642            0.3279409
9  hiv_Not tested      1633       0.3270579           1660            0.3315359
10   hiv_Positive      1681       0.3366713           1705            0.3405233

But I cannot get it to work with summarise_each in dplyr; only overall variable counts and proportions, and not for each factor level, are given:

df.out<-df %>%
  group_by(group) %>%
  summarise_each(funs(N=n(), Percent=n()/sum(n())), sex, agegroup, hiv)
print(df.out)

group sex_N agegroup_N hiv_N sex_Percent agegroup_Percent hiv_Percent
1     1  4973       4973  4973           1                1           1
2     2  5027       5027  5027           1                1           1

Finally, is there some way to reshape the table (e.g. using tidyr), so that the exposure variables (sex, agegroup, hiv) are reported as rows?

Thanks

Jaap · Accepted Answer

Doing it in two steps will give you the desired result. First, calculate the n, then calculate the percentage by group:

library(dplyr)
df.out <- df %>%
  group_by(group, sex, agegroup, hiv) %>%
  tally() %>%
  group_by(group) %>%
  mutate(percent=n/sum(n))

A solution with data.table:

library(data.table)
dt.out <- setDT(df)[, .N, by=.(group, sex, agegroup, hiv)][, percent:=N/sum(N), by=group]

library(microbenchmark)
microbenchmark(df.out = df %>%
                 group_by(group, sex, agegroup, hiv) %>%
                 tally() %>%
                 group_by(group) %>%
                 mutate(percent=n/sum(n)),
               dt.out = df[,.N,by=.(group, sex, agegroup, hiv)][,percent:=N/sum(N),by=group])

# Unit: milliseconds
#   expr      min       lq     mean   median       uq       max neval cld
# df.out 8.299870 8.518590 8.894504 8.708315 8.931459 11.964930   100   b
# dt.out 2.346632 2.394788 2.540132 2.441777 2.551235  4.344442   100  a

Conclusion: the data.table solution is much faster (3.5x).

To get a table like you requested after the edit of your question, you can do the following:

library(data.table)

setDT(df)
dt.sex <- dcast(df[,.N, by=.(sex,group)][,percent:=N/sum(N)], sex ~ group, value.var = c("N", "percent"))
dt.age <- dcast(df[,.N, by=.(agegroup,group)][,percent:=N/sum(N)], agegroup ~ group, value.var = c("N", "percent"))
dt.hiv <- dcast(df[,.N, by=.(hiv,group)][,percent:=N/sum(N)], hiv ~ group, value.var = c("N", "percent"))

dt.out.wide <- rbindlist(list(dt.sex, dt.age, dt.hiv), use.names=FALSE)
names(dt.out.wide) <- c("X","N_Intervention","N_Control","percent_Intervention","percent_Control")

this gives:

> dt.out.wide
             X N_Intervention N_Control percent_Intervention percent_Control
 1:       Male           2454      2488               0.2454          0.2488
 2:     Female           2561      2497               0.2561          0.2497
 3:      16-24            954       991               0.0954          0.0991
 4:      25-34           1033      1002               0.1033          0.1002
 5:      35-44           1051      1000               0.1051          0.1000
 6:      45-54            983       978               0.0983          0.0978
 7:        55+            994      1014               0.0994          0.1014
 8:   Positive           1717      1664               0.1717          0.1664
 9:   Negative           1637      1659               0.1637          0.1659
10: Not tested           1661      1662               0.1661          0.1662

Summarising levels of multiple factor variables

Answers (1)

Related Questions