user2020032
user2020032

Reputation: 31

collapsing by groups and calculating n in data.frame using R

I have a data frame that contains three variables: treatment, dose, and outcome (plus or minus). I have multiple observations for each treatment and dose. I'm trying to output a contingency table that would collapse the data to indicate the number of each outcome as a function of the treatment and dose, as well as the number of observations. For example:

treatment dose outcome    
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1 
treatmentA 2 1

The desired output would be:

treatment dose outcome n
control 0 0 1 4
treatmentA 1 2 3
treatmentA 2 3 3

I've played around with this all day and haven't had much luck beyond being able to get a frequency for each outcome for each observation. Any suggestions would be appreciated (including pointing out what parts of the R manual and/or examples) i've overlooked.

Thanks!

R

Upvotes: 2

Views: 233

Answers (5)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193537

Here are another couple of options (even thought the data.table approach clearly wins in succinctness of syntax).

The first uses ave within within. ave can apply a function to a variable (the first variable mentioned) grouped by one or more variables. We wrap the output in unique after dropping the now unnecessary "outcome" column.

unique(within(df, {
  SUM <- ave(outcome, treatment, dose, FUN = sum)
  COUNT <- ave(outcome, treatment, dose, FUN = length)
  rm(outcome)
}))
#    treatment dose COUNT SUM
# 1    control    0     4   1
# 5 treatmentA    1     3   2
# 8 treatmentA    2     3   3

A second solution in base R is very similar to @geektrader's answer, except it calculates both sum and length in one call to aggregate. There is a "downside" though: the result of that cbind is a "column" in your data.frame that is actually a matrix. See the result of str to see what I mean.

temp <- aggregate(outcome ~ treatment + dose, df, 
                  function(x) cbind(sum(x), length(x)))
str(temp)
# 'data.frame':  3 obs. of  3 variables:
#  $ treatment: Factor w/ 2 levels "control","treatmentA": 1 2 2
#  $ dose     : int  0 1 2
#  $ outcome  : int [1:3, 1:2] 1 2 3 4 3 3

colnames(temp$outcome) <- c("SUM", "COUNT")
temp
#    treatment dose outcome.SUM outcome.COUNT
# 1    control    0           1             4
# 2 treatmentA    1           2             3
# 3 treatmentA    2           3             3

I mention storage structure as a "downside" mostly because you might not get what you expect when you try to access the data in ways you might be accustomed to.

temp$outcome.SUM
# NULL

temp$outcome
#      SUM COUNT
# [1,]   1     4
# [2,]   2     3
# [3,]   3     3

Instead, you have to access it via:

temp$outcome[, "SUM"] ## or temp$outcome[, 1]
# [1] 1 2 3

Upvotes: 2

Anthony Damico
Anthony Damico

Reputation: 6104

imho, sql is underrated. :)

# read in your example data as `x`
x <- read.table( text = "treatment dose outcome    
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1 
treatmentA 2 1",h=T)

# load the sql data frame library
library(sqldf)

# create a new table of all unique `treatment` and `dose` columns,
# summing the `outcome` column and 
# counting the number of records in each combo
y <- sqldf( 'SELECT treatment, dose , 
                    sum( outcome ) as outcome , 
                    count(*) as n 
             FROM   x 
             GROUP BY treatment, dose' )

# check the results
y

Upvotes: 2

CHP
CHP

Reputation: 17189

If you don't want to use extra libraries as suggested in other answers, you can try following.

> df
    treatment dose outcome
1     control    0       0
2     control    0       0
3     control    0       0
4     control    0       1
5  treatmentA    1       0
6  treatmentA    1       1
7  treatmentA    1       1
8  treatmentA    2       1
9  treatmentA    2       1
10 treatmentA    2       1

> dput(df)
structure(list(treatment = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L), .Label = c("control", "treatmentA"), class = "factor"), 
    dose = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 2L, 2L, 2L), outcome = c(0L, 
    0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L)), .Names = c("treatment", 
"dose", "outcome"), class = "data.frame", row.names = c(NA, -10L
))

Now we use aggregate function to get count and sum of outcome column

> nObs <- aggregate(outcome ~ treatment + dose, data = df, length)
> sObs <- aggregate(outcome ~ treatment + dose, data = df, sum)

Change names of of aggregated column appropriately

names(nObs) <- c('treatment', 'dose', 'count')

> names(sObs) <- c('treatment', 'dose', 'sum')

> nObs
   treatment dose count
1    control    0     4
2 treatmentA    1     3
3 treatmentA    2     3


> sObs
   treatment dose sum
1    control    0   1
2 treatmentA    1   2
3 treatmentA    2   3

Use merge to combine above two by all columns by same name treatment and dose in this case.

> result <- merge(nObs, sObs)
> result
   treatment dose count sum
1    control    0     4   1
2 treatmentA    1     3   2
3 treatmentA    2     3   3

Upvotes: 4

Victor K.
Victor K.

Reputation: 4094

Here is a solution using a wonderful package data.table:

library(data.table)
x <- data.table(read.table( text = "treatment dose outcome    
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1 
treatmentA 2 1", header = TRUE)
x[, list(outcome = sum(outcome), count = .N), by = 'treatment,dose']

produces

    treatment dose outcome count
1:    control    0       1     4
2: treatmentA    1       2     3
3: treatmentA    2       3     3

Upvotes: 5

Jason Morgan
Jason Morgan

Reputation: 2330

If I understand correctly, this is straightforward with the data.table library. First, load the library and read the data in:

library(data.table)

data <- read.table(header=TRUE, text="
treatment dose outcome    
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1 
treatmentA 2 1")

Next, create a data.table with the treatment and dose columns as the table keys (indices).

data <- data.table(data, key="treatment,dose")

Then aggregate using data.table syntax.

data[, list(outcome=sum(outcome), n=length(outcome)), by=list(treatment,dose)]

    treatment dose outcome n
1:    control    0       1 4
2: treatmentA    1       2 3
3: treatmentA    2       3 3

Upvotes: 3

Related Questions