Jeff Jarvis
Jeff Jarvis

Reputation: 95

in R, Creating a summary table with comparisons of two groups

I frequently want to create summary tables for studies where I compare several variables between two groups, listing values for each variable along with the difference between that variable for the two groups.

For example, say I want to compare age groups (young and old) and proportion of males between two groups, A and B. I’d like to end up with a table with rows for each variable (age, proportion of males) and columns for the following variables repeated for each group (numerator, denominators, rate, difference between the two rates, 95%CI, p-value from a chi-square).

I’m looking for a general approach to this type of table.

Let’s say I have the following table:

library(dplyr)    
AgeGroup <- sample(c("Young", "Old"), 10, replace = TRUE)
Gender <- sample(c("Male", "Female"), 10, replace = TRUE)
df <- data.frame(AgeGroup, Gender)
df

I can create a summary table without the comparison easily:

df1 <- df %>%
  group_by(AgeGroup) %>%
  summarise(num_M = sum(Gender == "Male"),
            den_M = n(),
            prop_M = num_M/den_M)
df1

But I can’t figure out how to create additional columns of comparisons between the different rows of grouped data. Let’s say I want to do a chi.sq test on the proportion of Males in each AgeGroup and add the p-value to the summary table above.

It would look like this (numbers, obviously, are examples), Y = Young, O = Old:

enter image description here

Any gentle nudges in the right direction would be greatly appreciated.

Thanks!

Upvotes: 0

Views: 2495

Answers (1)

Marius
Marius

Reputation: 60070

I like the finalfit package for summary tables. If you need to add custom summary functions, it might not be flexible enough, but its default stats cover everything you've asked for in your example, e.g. numbers in each group, proportions, and a chi-squared test. If you have continuous variables it will calculate means and SDs in each group.

library(finalfit)

finalfit::summary_factorlist(
    df,
    dependent = "Gender", 
    explanatory = "AgeGroup",
    total_col = TRUE,
    p = TRUE
)

Output:

     label levels   Female      Male Total     p
1 AgeGroup    Old  0 (0.0) 6 (100.0)     6 0.197
2           Young 1 (25.0)  3 (75.0)     4    

Upvotes: 4

Related Questions