Marissa
Marissa

Reputation: 385

Dynamically create variable names and values using paste in R

I have a dataset with a bunch of survey questions, many of which have subquestions. It looks like this:

enter image description here

structure(list(Q1 = c(1, 2, 2), Q2 = c(2, 3, 1), Q3a = c("a", 
"a", "c"), Q3b = c("b", "b", "d"), Q3c = c("c", "c", "e"), Q4a = c("a", 
"a", "c"), Q4b = c("b", "b", "d"), Q4c = c("c", "c", "e"), Q4d = c("d", 
"d", "f"), Q5 = c("e", "f", "g")), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame"))

Basically for each of the columns that fits the pattern "Q[0-9]+[a-z]", I want to create a new column with that question number, where the values are all values from the subquestions pasted together in a string. I know I can do this manually, but there are a number of questions so if there is a way to do this programmatically I would appreciate it. Desired output is:

enter image description here


structure(list(Q1 = c(1, 2, 2), Q2 = c(2, 3, 1), Q3a = c("a", 
"a", "c"), Q3b = c("b", "b", "d"), Q3c = c("c", "c", "e"), Q4a = c("a", 
"a", "c"), Q4b = c("b", "b", "d"), Q4c = c("c", "c", "e"), Q4d = c("d", 
"d", "f"), Q5 = c("e", "f", "g"), Q3 = c("a, b, c", "a, b, c", 
"c, d, e"), Q4 = c("a, b, c, d", "a, b, c, d", "c, d, e, f")), row.names = c(NA, 
-3L), class = c("tbl_df", "tbl", "data.frame"))

Upvotes: 1

Views: 623

Answers (2)

Ben
Ben

Reputation: 30494

With tidyverse you could do the following. After selecting Q1 through Q5 (the columns needed to reproduce the new columns and omitting the already desired Q3 and Q4), you can put the data frame into long format (often preferable for future analyses). Row numbers rn are assigned to track rows, and all columns are made character so can be combined.

When in long format, the items sharing the same number (e.g., Q3a, Q3b, etc.) are grouped and then summarised, putting multiple values of the same group together separated by commas. Then, to get to your desired wide data frame, you can use pivot_wider.

library(tidyverse)

df %>%
  select(Q1:Q5) %>%
  mutate_all(as.character) %>%
  mutate(rn = row_number()) %>%
  pivot_longer(cols = -rn, 
               names_to = c("item", "subitem"), 
               names_pattern = "Q(\\d+)(\\w*)") %>%
  group_by(rn, item) %>%
  summarise(value = toString(value)) %>%
  pivot_wider(id_cols = rn, 
              names_from = item, 
              values_from = value, names_prefix = "Q") %>%
  ungroup %>%
  select(-rn)

Output

  Q1    Q2    Q3      Q4         Q5   
  <chr> <chr> <chr>   <chr>      <chr>
1 1     2     a, b, c a, b, c, d e    
2 2     3     a, b, c a, b, c, d f    
3 2     1     c, d, e c, d, e, f g  

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 389175

We can use grep to identify the columns and use it in split.default to create list of similarly named columns. We can paste the values rowwise for each element in list.

cols <- grep("Q[0-9]+[a-z]", names(df), value = TRUE)

cbind(df, sapply(split.default(df[cols], sub('(Q\\d+).*', '\\1', cols)), 
          function(p) do.call(paste, c(p, sep = ','))))

#  Q1 Q2 Q3a Q3b Q3c Q4a Q4b Q4c Q4d Q5    Q3      Q4
#1  1  2   a   b   c   a   b   c   d  e a,b,c a,b,c,d
#2  2  3   a   b   c   a   b   c   d  f a,b,c a,b,c,d
#3  2  1   c   d   e   c   d   e   f  g c,d,e c,d,e,f

Upvotes: 0

Related Questions