Reputation: 31
I have a data frame that contains three variables: treatment, dose, and outcome (plus or minus). I have multiple observations for each treatment and dose. I'm trying to output a contingency table that would collapse the data to indicate the number of each outcome as a function of the treatment and dose, as well as the number of observations. For example:
treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1
The desired output would be:
treatment dose outcome n
control 0 0 1 4
treatmentA 1 2 3
treatmentA 2 3 3
I've played around with this all day and haven't had much luck beyond being able to get a frequency for each outcome for each observation. Any suggestions would be appreciated (including pointing out what parts of the R manual and/or examples) i've overlooked.
Thanks!
R
Upvotes: 2
Views: 233
Reputation: 193537
Here are another couple of options (even thought the data.table
approach clearly wins in succinctness of syntax).
The first uses ave
within within
. ave
can apply a function to a variable (the first variable mentioned) grouped by one or more variables. We wrap the output in unique
after dropping the now unnecessary "outcome" column.
unique(within(df, {
SUM <- ave(outcome, treatment, dose, FUN = sum)
COUNT <- ave(outcome, treatment, dose, FUN = length)
rm(outcome)
}))
# treatment dose COUNT SUM
# 1 control 0 4 1
# 5 treatmentA 1 3 2
# 8 treatmentA 2 3 3
A second solution in base R is very similar to @geektrader's answer, except it calculates both sum
and length
in one call to aggregate. There is a "downside" though: the result of that cbind
is a "column" in your data.frame
that is actually a matrix. See the result of str
to see what I mean.
temp <- aggregate(outcome ~ treatment + dose, df,
function(x) cbind(sum(x), length(x)))
str(temp)
# 'data.frame': 3 obs. of 3 variables:
# $ treatment: Factor w/ 2 levels "control","treatmentA": 1 2 2
# $ dose : int 0 1 2
# $ outcome : int [1:3, 1:2] 1 2 3 4 3 3
colnames(temp$outcome) <- c("SUM", "COUNT")
temp
# treatment dose outcome.SUM outcome.COUNT
# 1 control 0 1 4
# 2 treatmentA 1 2 3
# 3 treatmentA 2 3 3
I mention storage structure as a "downside" mostly because you might not get what you expect when you try to access the data in ways you might be accustomed to.
temp$outcome.SUM
# NULL
temp$outcome
# SUM COUNT
# [1,] 1 4
# [2,] 2 3
# [3,] 3 3
Instead, you have to access it via:
temp$outcome[, "SUM"] ## or temp$outcome[, 1]
# [1] 1 2 3
Upvotes: 2
Reputation: 6104
imho, sql is underrated. :)
# read in your example data as `x`
x <- read.table( text = "treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1",h=T)
# load the sql data frame library
library(sqldf)
# create a new table of all unique `treatment` and `dose` columns,
# summing the `outcome` column and
# counting the number of records in each combo
y <- sqldf( 'SELECT treatment, dose ,
sum( outcome ) as outcome ,
count(*) as n
FROM x
GROUP BY treatment, dose' )
# check the results
y
Upvotes: 2
Reputation: 17189
If you don't want to use extra libraries as suggested in other answers, you can try following.
> df
treatment dose outcome
1 control 0 0
2 control 0 0
3 control 0 0
4 control 0 1
5 treatmentA 1 0
6 treatmentA 1 1
7 treatmentA 1 1
8 treatmentA 2 1
9 treatmentA 2 1
10 treatmentA 2 1
> dput(df)
structure(list(treatment = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("control", "treatmentA"), class = "factor"),
dose = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 2L, 2L, 2L), outcome = c(0L,
0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L)), .Names = c("treatment",
"dose", "outcome"), class = "data.frame", row.names = c(NA, -10L
))
Now we use aggregate
function to get count and sum of outcome
column
> nObs <- aggregate(outcome ~ treatment + dose, data = df, length)
> sObs <- aggregate(outcome ~ treatment + dose, data = df, sum)
Change names of of aggregated column appropriately
names(nObs) <- c('treatment', 'dose', 'count')
> names(sObs) <- c('treatment', 'dose', 'sum')
> nObs
treatment dose count
1 control 0 4
2 treatmentA 1 3
3 treatmentA 2 3
> sObs
treatment dose sum
1 control 0 1
2 treatmentA 1 2
3 treatmentA 2 3
Use merge
to combine above two by all columns by same name treatment
and dose
in this case.
> result <- merge(nObs, sObs)
> result
treatment dose count sum
1 control 0 4 1
2 treatmentA 1 3 2
3 treatmentA 2 3 3
Upvotes: 4
Reputation: 4094
Here is a solution using a wonderful package data.table
:
library(data.table)
x <- data.table(read.table( text = "treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1", header = TRUE)
x[, list(outcome = sum(outcome), count = .N), by = 'treatment,dose']
produces
treatment dose outcome count
1: control 0 1 4
2: treatmentA 1 2 3
3: treatmentA 2 3 3
Upvotes: 5
Reputation: 2330
If I understand correctly, this is straightforward with the data.table
library. First, load the library and read the data in:
library(data.table)
data <- read.table(header=TRUE, text="
treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1")
Next, create a data.table
with the treatment
and dose
columns as the table keys (indices).
data <- data.table(data, key="treatment,dose")
Then aggregate using data.table
syntax.
data[, list(outcome=sum(outcome), n=length(outcome)), by=list(treatment,dose)]
treatment dose outcome n
1: control 0 1 4
2: treatmentA 1 2 3
3: treatmentA 2 3 3
Upvotes: 3