Dynamically create variable names and values using paste in R

Question

I have a dataset with a bunch of survey questions, many of which have subquestions. It looks like this:

structure(list(Q1 = c(1, 2, 2), Q2 = c(2, 3, 1), Q3a = c("a", 
"a", "c"), Q3b = c("b", "b", "d"), Q3c = c("c", "c", "e"), Q4a = c("a", 
"a", "c"), Q4b = c("b", "b", "d"), Q4c = c("c", "c", "e"), Q4d = c("d", 
"d", "f"), Q5 = c("e", "f", "g")), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame"))

Basically for each of the columns that fits the pattern "Q[0-9]+[a-z]", I want to create a new column with that question number, where the values are all values from the subquestions pasted together in a string. I know I can do this manually, but there are a number of questions so if there is a way to do this programmatically I would appreciate it. Desired output is:


structure(list(Q1 = c(1, 2, 2), Q2 = c(2, 3, 1), Q3a = c("a", 
"a", "c"), Q3b = c("b", "b", "d"), Q3c = c("c", "c", "e"), Q4a = c("a", 
"a", "c"), Q4b = c("b", "b", "d"), Q4c = c("c", "c", "e"), Q4d = c("d", 
"d", "f"), Q5 = c("e", "f", "g"), Q3 = c("a, b, c", "a, b, c", 
"c, d, e"), Q4 = c("a, b, c, d", "a, b, c, d", "c, d, e, f")), row.names = c(NA, 
-3L), class = c("tbl_df", "tbl", "data.frame"))

Ben · Accepted Answer

With tidyverse you could do the following. After selecting Q1 through Q5 (the columns needed to reproduce the new columns and omitting the already desired Q3 and Q4), you can put the data frame into long format (often preferable for future analyses). Row numbers rn are assigned to track rows, and all columns are made character so can be combined.

When in long format, the items sharing the same number (e.g., Q3a, Q3b, etc.) are grouped and then summarised, putting multiple values of the same group together separated by commas. Then, to get to your desired wide data frame, you can use pivot_wider.

library(tidyverse)

df %>%
  select(Q1:Q5) %>%
  mutate_all(as.character) %>%
  mutate(rn = row_number()) %>%
  pivot_longer(cols = -rn, 
               names_to = c("item", "subitem"), 
               names_pattern = "Q(\d+)(\w*)") %>%
  group_by(rn, item) %>%
  summarise(value = toString(value)) %>%
  pivot_wider(id_cols = rn, 
              names_from = item, 
              values_from = value, names_prefix = "Q") %>%
  ungroup %>%
  select(-rn)

Output

  Q1    Q2    Q3      Q4         Q5   
             
1 1     2     a, b, c a, b, c, d e    
2 2     3     a, b, c a, b, c, d f    
3 2     1     c, d, e c, d, e, f g

Dynamically create variable names and values using paste in R

Answers (2)

Related Questions