YKY
YKY

Reputation: 2693

Creating pivot table for fully categorical dataset in R

I have a table where all the values are categorical. It looks like this:

sample region question1 question2
1 reg1 yes yes
2 reg2 yes maybe
3 reg3 yes maybe
4 reg3 no yes

Is there an easy way to aggregate the data to count how many people answered yes and no to a question without writing loops?

My goals is get something like:

question
    yes no
reg1 15 20
reg2 30 11

etc.

I've checked reshape2 package but it seems that it does not do what I need.

Upvotes: 3

Views: 673

Answers (2)

Ben Bolker
Ben Bolker

Reputation: 226732

The simplest answer seems to be table().

Make up data:

dd <- read.table(text="
sample region question1 question2
1 reg1 yes yes
2 reg2 yes maybe
3 reg3 yes maybe
4 reg3 no yes",
header=TRUE)

Analyze:

with(dd,table(region,question1))
##       question1
## region no yes
##   reg1  0   1
##   reg2  0   1
##   reg3  1   1

This works because question 1 has no "maybe" responses. If you want to leave them out, you could:

dd2 <- subset(dd,question1 %in% c("no","yes"))
with(dd2,table(...))

(with() is just a convenience; you could type table(dd$region,dd$question1) instead.)

For more complex questions you may well want to look into reshape2/plyr or tidyr/dplyr ...

For example, to summarize results for all questions simultaneously:

library("reshape2")
ddm <- melt(dd,id.vars=c("sample","region"))
tt <- with(ddm,table(region,value,variable))

This gives us a 3x3x2 table (a regionxvalue table for each question). If we want it in long format:

ttm <- melt(tt,value.name="count")
res <- dcast(ttm,region+variable~value,value.var="count")
##   region  variable maybe no yes
## 1   reg1 question1     0  0   1
## 2   reg1 question2     0  0   1
## 3   reg2 question1     0  0   1
## 4   reg2 question2     1  0   0
## 5   reg3 question1     0  1   1
## 6   reg3 question2     1  0   1

Upvotes: 4

shadowtalker
shadowtalker

Reputation: 13903

I'm assuming your data is in a data.frame like this:

# build the sample data
sample_data <- data.frame(
  sample = 1:4,
  region = c("reg1", "reg2", "reg3", "reg3"),
  question1 = c("yes", "yes", "yes", "no"),
  question2 = c("yes", "maybe", "maybe", "yes"),
  stringsAsFactors = TRUE
)

# get the variable names you want to summarize
question_vars <- grep("question", names(sample_data), value = TRUE)

and that you want to aggregate by region.

There's no reason to specifically want to avoid looping here. However you can do the aggregation for each variable in one line:

# pre-allocate the list. not strictly necessary here, and you could
#   just write `tables_by_region <- list()`, but preallocation is good practice,
#   especially when the list is very long
tables_by_region <- vector("list", length(question_vars))
names(tables_by_region) <- question_vars

# loop over question vars 
for (question_var in question_vars) {
  tables_by_region[[question_var]] <- aggregate(sample_data[[question_var]],
                                                sample_data["region"], table)
}
# be careful with your `[`s and `"`s!

You can, however, shorten this code with an advanced use of lapply:

tables_by_region <- lapply(sample_data[question_vars], aggregate,
                           sample_data["region"], table)

Upvotes: 1

Related Questions