Reputation: 385
I have a dataset with a bunch of survey questions, many of which have subquestions. It looks like this:
structure(list(Q1 = c(1, 2, 2), Q2 = c(2, 3, 1), Q3a = c("a",
"a", "c"), Q3b = c("b", "b", "d"), Q3c = c("c", "c", "e"), Q4a = c("a",
"a", "c"), Q4b = c("b", "b", "d"), Q4c = c("c", "c", "e"), Q4d = c("d",
"d", "f"), Q5 = c("e", "f", "g")), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
Basically for each of the columns that fits the pattern "Q[0-9]+[a-z]", I want to create a new column with that question number, where the values are all values from the subquestions pasted together in a string. I know I can do this manually, but there are a number of questions so if there is a way to do this programmatically I would appreciate it. Desired output is:
structure(list(Q1 = c(1, 2, 2), Q2 = c(2, 3, 1), Q3a = c("a",
"a", "c"), Q3b = c("b", "b", "d"), Q3c = c("c", "c", "e"), Q4a = c("a",
"a", "c"), Q4b = c("b", "b", "d"), Q4c = c("c", "c", "e"), Q4d = c("d",
"d", "f"), Q5 = c("e", "f", "g"), Q3 = c("a, b, c", "a, b, c",
"c, d, e"), Q4 = c("a, b, c, d", "a, b, c, d", "c, d, e, f")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
Upvotes: 1
Views: 623
Reputation: 30494
With tidyverse
you could do the following. After selecting Q1 through Q5 (the columns needed to reproduce the new columns and omitting the already desired Q3 and Q4), you can put the data frame into long format (often preferable for future analyses). Row numbers rn
are assigned to track rows, and all columns are made character
so can be combined.
When in long format, the items sharing the same number (e.g., Q3a, Q3b, etc.) are grouped and then summarised, putting multiple values of the same group together separated by commas. Then, to get to your desired wide data frame, you can use pivot_wider
.
library(tidyverse)
df %>%
select(Q1:Q5) %>%
mutate_all(as.character) %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn,
names_to = c("item", "subitem"),
names_pattern = "Q(\\d+)(\\w*)") %>%
group_by(rn, item) %>%
summarise(value = toString(value)) %>%
pivot_wider(id_cols = rn,
names_from = item,
values_from = value, names_prefix = "Q") %>%
ungroup %>%
select(-rn)
Output
Q1 Q2 Q3 Q4 Q5
<chr> <chr> <chr> <chr> <chr>
1 1 2 a, b, c a, b, c, d e
2 2 3 a, b, c a, b, c, d f
3 2 1 c, d, e c, d, e, f g
Upvotes: 2
Reputation: 389175
We can use grep
to identify the columns and use it in split.default
to create list of similarly named columns. We can paste
the values rowwise for each element in list.
cols <- grep("Q[0-9]+[a-z]", names(df), value = TRUE)
cbind(df, sapply(split.default(df[cols], sub('(Q\\d+).*', '\\1', cols)),
function(p) do.call(paste, c(p, sep = ','))))
# Q1 Q2 Q3a Q3b Q3c Q4a Q4b Q4c Q4d Q5 Q3 Q4
#1 1 2 a b c a b c d e a,b,c a,b,c,d
#2 2 3 a b c a b c d f a,b,c a,b,c,d
#3 2 1 c d e c d e f g c,d,e c,d,e,f
Upvotes: 0