tcvdb1992
tcvdb1992

Reputation: 455

Pivot_longer for multiple columns of repeated measurements data

I am trying to use the pivot_longer function from the dplyr package to get my data into a long format. The current wide data involves 3 repeated measurements of the patients' age, their systolic blood pressure and if they used blood pressure lowering medication use (med_hypt), and the time invariant 'sex' variable.

The example data and what I've tried:

library(tidyverse)
library(dplyr)
library(magrittr)

wide_data <- structure(list(id = c(12002, 17001, 17002, 42001, 66001, 82002, 166002, 177001, 177002, 240001), 
                            sex = structure(c(2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L), 
                                            .Label = c("men", "women"), class = "factor"), 
                            time1_age = c(71.2, 67.9, 66.5, 57.7, 57.1, 60.9, 80.9, 59.7, 58.2, 66.6), 
                            time1_systolicBP = c(102, 152, NA_real_, 170, 151, 135, 162, 133, 131, 117), 
                            time1_med_hypt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
                            time2_age = c(74.2, 69.2, 67.8, 58.9, 58.4, 62.5, 82.2, 61, 59.5, 67.8), 
                            time2_systolicBP = c(NA_real_, 146, NA_real_, 151, 129, 129, 137, 144, NA_real_, 132), 
                            time2_med_hypt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
                            time3_age = c(78, 74.2, 72.8, 64.1, 63.3, 67.7, 87.1, 66, 64.5, 72.9), 
                            time3_systolicBP = c(NA_real_, 160.5, NA_real_, 171, 135, 160, 151, 166, 129, 150.5), 
                            time3_med_hypt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), 
                       row.names = c(NA, 10L), class = "data.frame")

# Pivoting to a longer format
long_data <- wide_data %>% 
  pivot_longer(
    cols=!id,
    names_to=c(".value", "time"), 
    names_sep="_", 
    values_drop_na=FALSE
  )

This produces the following tibble:

# A tibble: 40 x 6
      id time       sex   time1 time2 time3
   <dbl> <chr>      <fct> <dbl> <dbl> <dbl>
 1 12002 NA         women  NA    NA    NA  
 2 12002 age        NA     71.2  74.2  78  
 3 12002 systolicBP NA    102    NA    NA  
 4 12002 med        NA      0     0     0  
 5 17001 NA         men    NA    NA    NA  
 6 17001 age        NA     67.9  69.2  74.2
 7 17001 systolicBP NA    152   146   160. 
 8 17001 med        NA      0     0     0  
 9 17002 NA         women  NA    NA    NA  
10 17002 age        NA     66.5  67.8  72.8
# ... with 30 more rows

What I want is for the column names to be id, time, age, sex, systolicBP, and med_hypt. With 3 rows per patient corresponding to the 3 repeated measurements.

Upvotes: 1

Views: 3888

Answers (3)

Anoushiravan R
Anoushiravan R

Reputation: 21938

This probably adds nothing new to the already posted solutions, the only difference is the regex used for the names_pattern argument.

  • If you notice some of your column names are separated by one _ whereas others are separated by two _. \\w+ captures any word character, now if I specify we have a number after this with \\d+ as in time3 in time3_age, we tell pivot_longer to store this part of the column names corresponding to time3 in time column. Then the rest of the column names are used for the variable names we are trying to measure line age, systolicBP and med_hypt.
  • It should be noted that if we use \\w+\\d+ instead of \\w+ only the rest will be captured as column names whether it is med_hypt with underscore or systolicBP without underscore. But if we use only \\w+ it could also capture med and the resulting column will be hypt instead of med_hypt.
  • In the end since I defined two capture groups, I have to define either names_pattern or names_sep in a way to specify how each of them are defined and separated.
library(dplyr)

wide_data %>%
  pivot_longer(!c(id, sex), names_to = c("time", ".value"), 
               names_pattern = "(\\w+\\d+)_(\\w+)")

# A tibble: 30 x 6
      id sex   time    age systolicBP med_hypt
   <dbl> <fct> <chr> <dbl>      <dbl>    <dbl>
 1 12002 women time1  71.2       102         0
 2 12002 women time2  74.2        NA         0
 3 12002 women time3  78          NA         0
 4 17001 men   time1  67.9       152         0
 5 17001 men   time2  69.2       146         0
 6 17001 men   time3  74.2       160.        0
 7 17002 women time1  66.5        NA         0
 8 17002 women time2  67.8        NA         0
 9 17002 women time3  72.8        NA         0
10 42001 men   time1  57.7       170         0
# ... with 20 more rows

Upvotes: 6

Ronak Shah
Ronak Shah

Reputation: 389135

Since you have more than one underscore in certain column names it is better to use names_pattern instead of names_sep. names_pattern allows us to pass flexible regex patterns to capture from the column name.

tidyr::pivot_longer(wide_data, 
    cols=-c(id, sex),
    names_to=c("time", ".value"), 
    names_pattern = '(.*?)_(.*)$', 
  )

#      id sex   time    age systolicBP med_hypt
#   <dbl> <fct> <chr> <dbl>      <dbl>    <dbl>
# 1 12002 women time1  71.2       102         0
# 2 12002 women time2  74.2        NA         0
# 3 12002 women time3  78          NA         0
# 4 17001 men   time1  67.9       152         0
# 5 17001 men   time2  69.2       146         0
# 6 17001 men   time3  74.2       160.        0
# 7 17002 women time1  66.5        NA         0
# 8 17002 women time2  67.8        NA         0
# 9 17002 women time3  72.8        NA         0
#10 42001 men   time1  57.7       170         0
# … with 20 more rows

Upvotes: 5

Yuriy Saraykin
Yuriy Saraykin

Reputation: 8880

if I understood correctly

   wide_data %>% 
      pivot_longer(
        cols=-c(id, sex),
        names_to=c(".value", "time"), 
        names_sep = "_", 
        values_drop_na=FALSE
      )

# A tibble: 30 x 6
      id sex   time       time1 time2 time3
   <dbl> <fct> <chr>      <dbl> <dbl> <dbl>
 1 12002 women age         71.2  74.2  78  
 2 12002 women systolicBP 102    NA    NA  
 3 12002 women med          0     0     0  
 4 17001 men   age         67.9  69.2  74.2
 5 17001 men   systolicBP 152   146   160. 
 6 17001 men   med          0     0     0  
 7 17002 women age         66.5  67.8  72.8
 8 17002 women systolicBP  NA    NA    NA  
 9 17002 women med          0     0     0  
10 42001 men   age         57.7  58.9  64.1

Upvotes: 0

Related Questions