Reputation: 2693
I have a table where all the values are categorical. It looks like this:
sample region question1 question2
1 reg1 yes yes
2 reg2 yes maybe
3 reg3 yes maybe
4 reg3 no yes
Is there an easy way to aggregate the data to count how many people answered yes and no to a question without writing loops?
My goals is get something like:
question
yes no
reg1 15 20
reg2 30 11
etc.
I've checked reshape2 package but it seems that it does not do what I need.
Upvotes: 3
Views: 673
Reputation: 226732
The simplest answer seems to be table()
.
Make up data:
dd <- read.table(text="
sample region question1 question2
1 reg1 yes yes
2 reg2 yes maybe
3 reg3 yes maybe
4 reg3 no yes",
header=TRUE)
Analyze:
with(dd,table(region,question1))
## question1
## region no yes
## reg1 0 1
## reg2 0 1
## reg3 1 1
This works because question 1 has no "maybe" responses. If you want to leave them out, you could:
dd2 <- subset(dd,question1 %in% c("no","yes"))
with(dd2,table(...))
(with()
is just a convenience; you could type table(dd$region,dd$question1)
instead.)
For more complex questions you may well want to look into reshape2
/plyr
or tidyr
/dplyr
...
For example, to summarize results for all questions simultaneously:
library("reshape2")
ddm <- melt(dd,id.vars=c("sample","region"))
tt <- with(ddm,table(region,value,variable))
This gives us a 3x3x2 table (a regionxvalue table for each question). If we want it in long format:
ttm <- melt(tt,value.name="count")
res <- dcast(ttm,region+variable~value,value.var="count")
## region variable maybe no yes
## 1 reg1 question1 0 0 1
## 2 reg1 question2 0 0 1
## 3 reg2 question1 0 0 1
## 4 reg2 question2 1 0 0
## 5 reg3 question1 0 1 1
## 6 reg3 question2 1 0 1
Upvotes: 4
Reputation: 13903
I'm assuming your data is in a data.frame
like this:
# build the sample data
sample_data <- data.frame(
sample = 1:4,
region = c("reg1", "reg2", "reg3", "reg3"),
question1 = c("yes", "yes", "yes", "no"),
question2 = c("yes", "maybe", "maybe", "yes"),
stringsAsFactors = TRUE
)
# get the variable names you want to summarize
question_vars <- grep("question", names(sample_data), value = TRUE)
and that you want to aggregate by region
.
There's no reason to specifically want to avoid looping here. However you can do the aggregation for each variable in one line:
# pre-allocate the list. not strictly necessary here, and you could
# just write `tables_by_region <- list()`, but preallocation is good practice,
# especially when the list is very long
tables_by_region <- vector("list", length(question_vars))
names(tables_by_region) <- question_vars
# loop over question vars
for (question_var in question_vars) {
tables_by_region[[question_var]] <- aggregate(sample_data[[question_var]],
sample_data["region"], table)
}
# be careful with your `[`s and `"`s!
You can, however, shorten this code with an advanced use of lapply
:
tables_by_region <- lapply(sample_data[question_vars], aggregate,
sample_data["region"], table)
Upvotes: 1