micturalgia
micturalgia

Reputation: 325

Use R to create a large multiple column frequency table

I am having a hard time doing this efficiently and apologize if it's a basic question. I need to make a contingency table with N and percent to summarize relationships between a large number of binary variables, simply in terms of freq and percent, with no other summary statistics.

Specifically, it is to summarize the number of patients that have sample type X and clinical outcome Y. A patient can have any number of outcomes and any number of samples, i.e. each variable is non-mutually exclusive and independent.

I would like to put all the outcomes (Death, ICU admission, leg fell off...) as columns, and all the sample types (serum, urine, etc...) as rows. I would only need to list the frequency and percentage of "positive" responses, i.e. N and % of patients who died and who had a urine sample.

Are there any packages out there that could help with this kind of table? Everything I am finding is good for doing a nice 1xN variable contingency table. I wouldn't mind making a separate table for each outcome, if I can somehow extract a column of that output and bind them together into a master table to rule them all. Another idea is to somehow make a frequency table of two mChoice (Hmisc package) variables. I do not know if either of those two strategies are possible.

Any ideas?

What I'm looking for is something like this:

+-------------+--------+---------+
|             | Death  | ICU     |
|             | (N=10) | (N=50)  |
+-------------+--------+---------+
|Serum (N=50) |5 (50%) | 30 (60%)|
+-------------+--------+---------+
|Urine (N=40) |10(100%)| 7 (14%) |
+-------------+--------+---------+
|Brain (N=25) |6 (60%) | 15 (30%)|
+-------------+--------+---------+
|Kidney (N=50)|7 (70%) | 40 (80%)|
+-------------+--------+---------+

Edit to include sample data:

set.seed(1)
death<-runif(1000)<=.75
ICU<-runif(1000)<=.63
serum<-runif(1000)<=.80
urine<-runif(1000)<=.77
brain<-runif(1000)<=.92
kidney<-runif(1000)<=.22
df<-as.data.frame(cbind((1:1000),death,ICU,serum,urine,brain,kidney))

Upvotes: 0

Views: 4430

Answers (2)

rafa.pereira
rafa.pereira

Reputation: 13827

Here is a simple and fast solution using the data.table package.

library(data.table)

# convert your data frame to data.table
  setDT(df)


# create the output for serum
  serum <- df[serum==1, .(test="serum",
                          test.N = .N, 
                          death.count = sum(death),
                          death.N = sum(df$death),
                          death.prop=(sum(death)/sum(df$death))*100,
                          icu.count = sum(ICU),
                          icu.N = sum(df$ICU),
                          icu.prop=(sum(ICU)/sum(df$ICU))*100),
                          by=.(serum)]

# create the output for kidney
  kidney<- df[kidney==1, .(test="kidney",
                          test.N = .N, 
                          death.count = sum(death),
                          death.N = sum(df$death),
                          death.prop=(sum(death)/sum(df$death))*100,
                          icu.count = sum(ICU),
                          icu.N = sum(df$ICU),
                          icu.prop=(sum(ICU)/sum(df$ICU))*100),
                          by=.(kidney)]

# Bind outputs into a table
  table <- rbind( serum[,2:9,with = FALSE],
                  kidney[,2:9,with = FALSE])

table
>      test test.N death.count death.N death.prop icu.count icu.N icu.prop
> 1:  serum    806         602     752   80.05319       511   632 80.85443
> 2: kidney    190         141     752   18.75000       128   632 20.25316

Upvotes: 2

Ista
Ista

Reputation: 10437

EDIT: This is a revised answer offered after discussing the problem with the original poster. An older answer that does not solve the problem at hand is retained below for posterity.

This answer is not short nor concise, and I do hope there is a cleaner way. But the following will work:

## generate example data
set.seed(1)
death<-runif(1000)<=.75
ICU<-runif(1000)<=.63
serum<-runif(1000)<=.80
urine<-runif(1000)<=.77
brain<-runif(1000)<=.92
kidney<-runif(1000)<=.22
df<-as.data.frame(cbind((1:1000),death,ICU,serum,urine,brain,kidney))

## load up our data manipulation workhorses
library(reshape2)
library(plyr)

## save typing by saving row and column var names
row.vars <- c("serum", "urine", "brain", "kidney")
col.vars <- c("death", "ICU")

## melt data so we have death/icu in a column
dat.m <- melt(df, measure.vars = row.vars)

## get rid of rows with death==0 and ICU==0
dat.m <- dat.m[dat.m$value == 1, ]

## for each of death and icu calculate proportion of 1's
tab <- ddply(dat.m, "variable", function(DF) {
  colwise(function(x) length(x[x==1]))(DF[col.vars])
})

## calculate overall proportions for row and column vars
row.nums <- sapply(df[row.vars], function(x) length(x[x==1]))
col.nums <- sapply(df[col.vars], function(x) length(x[x==1]))

## paste row and column counts into row and column names
rownames(tab) <- paste(tab$variable, " (N=", row.nums, ")", sep="")
tab$variable <- NULL
colnames(tab) <- paste(names(tab), " (N=", col.nums, ")", sep="")

## calculate cell proportions and paste them in one column at a time
tab[[1]] <- paste(tab[[1]],
                  " (",
                  round(100*(tab[[1]]/col.nums[[1]]), digits=2),
                  "%)",
                  sep="")
tab[[2]] <- paste(tab[[2]],
                  " (",
                  round(100*(tab[[2]]/col.nums[[2]]),
                        digits=2),
                  "%)",
                  sep="")

Now we can

## behold the fruits of our labor
tab
               death (N=752)  ICU (N=632)
serum (N=806)   602 (80.05%) 511 (80.85%)
urine (N=739)   556 (73.94%)  462 (73.1%)
brain (N=910)   684 (90.96%) 576 (91.14%)
kidney (N=190)  141 (18.75%) 128 (20.25%)

OLD ANSWER (does not solve problem at hand, but may be useful for related tasks)

This is one of those things that seems like it should be easy, but somehow isn't.

There is an existing question that addresses this once you have two columns ready to tabulate. That part is easy:

# function to genderate example data
mkdat <- function() factor(sample(letters[1:4], 10, replace=TRUE), levels=letters[1:4])

# make example data
set.seed(10)
dat <- data.frame(id = 1:10, var1 = mkdat(), var2=mkdat(), var3=mkdat())

# use reshape2 package to reshape from wide to long form
library(reshape2)
dat.m <- melt(dat, id.vars="id")
dat.m$value <- factor(dat.m$value)

Now the cross tab of dat.m$variable and dat.m$value give the correct cells. You can refer to the linked question above on how to proceed from there to get both counts and percents in a table, or you can use this method:

# tabulate
library(plyr)
tab <- ddply(dat.m, "variable",
             function(DF) {
               # get counts with table
               count <- table(DF$value)
               # convert counts to percent
               prop <- paste(prop.table(count)*100, "%", sep="")
               # combine count and percent
               cp <- paste(count, " (", prop, ")", sep="")
               # re-attach the names
               names(cp) <- levels(DF$value)
               return(cp)
             })

# get row n
tab.r <- table(dat.m$variable)
# get column n
tab.c <- table(dat.m$value)
# paste row and column n into row and column names
colnames(tab) <- paste(colnames(tab), " (n = ", tab.c, ")", sep="")
rownames(tab) <- paste(tab$variable, " (n = ", tab.r, ")", sep="")
tab$variable <- NULL

# works, but that was way too much effort.
print(tab)

It has to be admitted that this is a lot of work for a simple count-and-proportion table. I'll be delighted if someone comes along with a simpler way to do it.

Upvotes: 1

Related Questions