Reputation: 163
I have been using gather() from the tidyr R package to tidy my survey data.
I wonder whether there is a way in which to deal with multiple choice questions when tidying data?
This question is not about a specific error, but more about what strategy is most fitting.
Imagine the following tibble:
tb1 <- tribble(~id,~x1,~x2,~x3,~y1,~y2,~z,
"Harry",1,1,NA,NA,1,"No",
"Jess",NA,1,1,1,1,"Yes",
"George",NA,NA,1,NA,1,"No")
When gathering this multiple question result, I get (logically), multiple rows for 'Harry', 'Jess' and 'George':
tb1 %>%
gather(X,val,x1:x3,-id,-z) %>%
filter(!is.na(val)) %>%
select(-val) %>%
gather(Y,val,y1:y2,-id,-X,-z) %>%
filter(!is.na(val)) %>%
select(-val)
# A tibble: 7 x 4
id z X Y
<chr> <chr> <chr> <chr>
1 Jess Yes x2 y1
2 Jess Yes x3 y1
3 Harry No x1 y2
4 Harry No x2 y2
5 Jess Yes x2 y2
6 Jess Yes x3 y2
7 George No x3 y2
I'm a bit worried about the multiple entries, and was wondering whether there's a good strategy to deal with multiple choice questions of a survey with binary columns that need to be gathered.
In the end, I'd like to be able to plot and analyse the values of various variables: i.e. the amount of times that people selected y2.
It seems that this long format is not practical to analyse this, as the count() will go up for all of Harry's double mentions of y2.
The flow of questions I have regarding this topic is as follows:
Upvotes: 0
Views: 1299
Reputation: 6441
I think the easiest way is definitly to gather all the reponses in one column.
library(tidyverse)
tb1 %>%
spread(key = z, value = z, sep = "_") %>%
gather(key = "Question", value = "Answer", -id, na.rm = T) %>%
select(-Answer) -> reshape_tb1
> reshape_tb1
# A tibble: 12 x 2
id Question
<chr> <chr>
1 Harry x1
2 Harry x2
3 Jess x2
4 George x3
5 Jess x3
6 Jess y1
7 George y2
8 Harry y2
9 Jess y2
10 George z_No
11 Harry z_No
12 Jess z_Yes
This way you can easily feed it to ggplot2
ggplot(reshape_tb1) +
geom_bar(aes(x = Question))
Upvotes: 1