Reputation: 10996

Frequency table for multiselect survey question across several columns

I want to do a fairly common analysis of survey questions in R, but am stuck in the middle.

Imagine a survey where you are asked to answer which brands do you associate with certain features (e.g. "brands" could be PlayStation, XBox..., and features could be "speed", "graphics"... where each brand can be checked on several features aka mulit-select). E.g. sth. like this here: https://www.harvestyourdata.com/fileadmin/images/question-type-screenshots/Grid-multi-select.jpg

You often refer to these questions as multi-select grid or matrix questions.

Anyway, from a data perspective, this kind of data is usually stored in wide format where each row*column combination is one variable, which is 0/1 coded (0 if the survey participant doesn't check the box, 1 otherwise).

Assuming we have 5 brands and 10 items, we would have 50 variables in total, ideally following a nice, structured naming scheme, e.g. item1_column1, item2_column1, item3_column1, [...], item1_column2 and so on.

Now, I want to analyze (frequency table) all of these variables in one iteration. I've already found the cross.multi.table function in the questionr package. However, it only allows to analyze all items based on on single factor. What I need instead is to allow for several columns at the same time.

Any ideas? MIght be I'm missing a function from another package or this can easily be done with tidyverse or even with the cross.multi.table function?

Using this data as test input:

dat = data.frame(item1_column1 = c(0,1,1,1),
                 item2_column1 = c(1,1,1,0),
                 item3_column1 = c(0,0,1,1),
                 item1_column2 = c(1,1,1,0),
                 item2_column2 = c(0,1,1,1),
                 item3_column2 = c(1,0,1,1),
                 item1_column3 = c(0,1,1,0),
                 item2_column3 = c(1,1,1,1),
                 item3_column3 = c(0,0,1,0))

I'd expect this output:

         column1    column2    column3
item1    3          3          2
item2    3          3          4
item3    2          3          1

or ideally as proportions/percentages:

         column1    column2    column3
item1    75%        75%        50%
item2    75%        75%        100%
item3    50%        75%        25%

Upvotes: 2

Answers (3)

akrun

Reputation: 887048

We can do this in base R, by creating a two column data.frame with the column names replicated, cbind with the unlisted values, and use xtabs to get the sum while pivoting to 'wide' format

out <- xtabs(val ~ ., cbind(read.table(text = names(dat)[col(dat)],
    sep="_", header = FALSE), val = unlist(dat, use.names = FALSE)))

out
#    V2
#V1      column1 column2 column3
#  item1       3       3       2
#  item2       3       3       4
#  item3       2       3       1

Or as @GKi mentioned (a compact version would be) to split the column names by _, create a data.frame with that along with colSums (or colMeans - for percentage) and use xtabs for pivoting

xtabs(n ~ ., data.frame(do.call("rbind", 
              strsplit(colnames(dat), "_")), n=colSums(dat)))

Or to get the percentage

xtabs(val ~ ., aggregate(val ~ ., cbind(read.table(text = names(dat)[col(dat)],
   sep="_", header = FALSE), val = unlist(dat, use.names = FALSE)), mean)) * 100
#     V2
#V1      column1 column2 column3
#  item1      75      75      50
#  item2      75      75     100
#  item3      50      75      25

Or inspired from @GKi, using enframe

library(dplyr)
library(tidyr)
library(tibble)
enframe(colSums(dat)) %>% 
   separate(name, into = c('name1', 'name2')) %>% 
   spread(name2, value)
# A tibble: 3 x 4
#  name1 column1 column2 column3
#  <chr>   <dbl>   <dbl>   <dbl>
#1 item1       3       3       2
#2 item2       3       3       4
#3 item3       2       3       1

To get the percentage, just change the first line of code to

enframe(100 *colMeans(dat))

Upvotes: 1

M--

Reputation: 28825

What I do here, by using data.table package, is summarizing each column, converting data to long format, breaking a column to two (item and column), and finally converting to wide format. Look below;

library(data.table)

dcast(setDT(melt(setDT(dat)[,100*colMeans(.SD),]),keep.rownames = T)[,
             c("item", "column") := tstrsplit(rn, "_", fixed=TRUE)],
                item ~ column, value.var = "value")

#>     item column1 column2 column3
#> 1: item1      75      75      50
#> 2: item2      75      75     100
#> 3: item3      50      75      25

Upvotes: 1

Ronak Shah

Reputation: 388907

One way could be to get data into long format using gather, separate columns based on _, group_by item and column and calculate the ratio of value column and spread the data to wide format.

library(dplyr)
library(tidyr)

dat %>%
  gather(key, value) %>%
  separate(key, into = c("item", "column"), sep = "_") %>%
  group_by(item, column) %>%
  summarise(prop = mean(value) * 100) %>%
  spread(column, prop)

#  item  column1 column2 column3
#  <chr>   <dbl>   <dbl>   <dbl>
#1 item1      75      75      50
#2 item2      75      75     100
#3 item3      50      75      25

A bit shorter (Thanks to @M-M)

dat %>%
  summarise_all(~mean(.) * 100) %>%
  gather(key, value) %>%
  separate(key, into = c("item", "column"), sep = "_") %>%
  spread(column, value)

Upvotes: 5

Frequency table for multiselect survey question across several columns

Answers (3)

Related Questions