Reshape from wide to long with multiple columns that have different naming patterns

Question

I have a longitudinal data set in wide format, with > 2500 columns. Almost all columns begin with 'W1_' or 'W2_' to indicate the wave (ie, time point) of data collection. In the real data, there are > 2 waves. They look like this:

# Populate wide format data frame
person <- c(1, 2, 3, 4)
W1_resp_sex <- c(1, 2, 1, 2)
W2_resp_sex <- c(1, 2, 1, 2)
W1_edu <- c(1, 2, 3, 4)
W2_q_2_1 <- c(0, 1, 1, 0)

wide <- as.data.frame(cbind(person, W1_resp_sex, W2_resp_sex, W1_edu, W2_q_2_1))
wide
#>   person W1_resp_sex W2_resp_sex W1_edu W2_q_2_1
#> 1      1           1           1      1        0
#> 2      2           2           2      2        1
#> 3      3           1           1      3        1
#> 4      4           2           2      4        0

I want to reshape from wide to long format so that the data look like this:

# Populate long data frame (this is how we want the wide data above to look after reshaping it)
person <- c(1, 1, 2, 2, 3, 3, 4, 4)
wave <- c(1, 2, 1, 2, 1, 2, 1, 2)
sex <- c(1, 1, 2, 2, 1, 1, 2, 2)
education <- c(1, NA, 2, NA, 3, NA, 4, NA)
q_2_1 <- c(NA, 0, NA, 1, NA, 1, NA, 0)

long_goal <- as.data.frame(cbind(person, wave, sex, education, q_2_1))
long_goal
#>   person wave sex education q_2_1
#> 1      1    1   1         1    NA
#> 2      1    2   1        NA     0
#> 3      2    1   2         2    NA
#> 4      2    2   2        NA     1
#> 5      3    1   1         3    NA
#> 6      3    2   1        NA     1
#> 7      4    1   2         4    NA
#> 8      4    2   2        NA     0

To reshape the data, I tried pivot_longer(). How do I fix these issues? (I prefer not to use data.table.)

The variables have different naming patterns (How can I correctly specify names_pattern() ?)
The multiple columns (see how all values are under the 'sex' column)
Creating a column with 'NA' when a variable was only collected in one wave (ie, if it was only collected in wave 2, I want a column with W1_varname in which all values are NA).

# Re-load wide format data
person <- c(1, 2, 3, 4)
W1_resp_sex <- c(1, 2, 1, 2)
W2_resp_sex <- c(1, 2, 1, 2)
W1_edu <- c(1, 2, 3, 4)
W2_q_2_1 <- c(0, 1, 1, 0)
wide <- as.data.frame(cbind(person, W1_resp_sex, W2_resp_sex, W1_edu, W2_q_2_1))

# Load package
pacman::p_load(tidyr)

# Reshape from wide to long 
long <- wide %>%
  pivot_longer(
    cols = starts_with('W'),
    names_to = 'Wave',
    names_prefix = 'W',
    names_pattern = '(.*)_',
    values_to = 'sex',
    values_drop_na = TRUE
  )
long
#> # A tibble: 16 × 3
#>    person Wave     sex
#>        
#>  1      1 1_resp     1
#>  2      1 2_resp     1
#>  3      1 1          1
#>  4      1 2_q_2      0
#>  5      2 1_resp     2
#>  6      2 2_resp     2
#>  7      2 1          2
#>  8      2 2_q_2      1
#>  9      3 1_resp     1
#> 10      3 2_resp     1
#> 11      3 1          3
#> 12      3 2_q_2      1
#> 13      4 1_resp     2
#> 14      4 2_resp     2
#> 15      4 1          4
#> 16      4 2_q_2      0

^{Created on 2022-09-19 by the reprex package (v2.0.1)}

akrun · Accepted Answer

We could reshape to 'long' with pivot_longer, specifying the names_pattern to capture substring from column names ((...)) that matches with the same order of names_to - i.e.. wave column will get the digits (\d+) after the 'W', where as the .value (value of the columns) correspond to the substring after the first _ in column names. Then, we could modify the resp_sex and edu by column names

library(dplyr)
library(tidyr)
pivot_longer(wide, cols = -person, names_to = c("wave", ".value"), 
    names_pattern = "^W(\d+)_(.*)$") %>%
   rename_with(~ c("sex", "education"), c("resp_sex", "edu"))

-output

# A tibble: 8 × 5
  person wave    sex education q_2_1
           
1      1 1         1         1    NA
2      1 2         1        NA     0
3      2 1         2         2    NA
4      2 2         2        NA     1
5      3 1         1         3    NA
6      3 2         1        NA     1
7      4 1         2         4    NA
8      4 2         2        NA     0

Reshape from wide to long with multiple columns that have different naming patterns

Answers (2)

Related Questions