Reputation: 455
I am trying to use the pivot_longer
function from the dplyr
package to get my data into a long format. The current wide data involves 3 repeated measurements of the patients' age, their systolic blood pressure and if they used blood pressure lowering medication use (med_hypt), and the time invariant 'sex' variable.
The example data and what I've tried:
library(tidyverse)
library(dplyr)
library(magrittr)
wide_data <- structure(list(id = c(12002, 17001, 17002, 42001, 66001, 82002, 166002, 177001, 177002, 240001),
sex = structure(c(2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L),
.Label = c("men", "women"), class = "factor"),
time1_age = c(71.2, 67.9, 66.5, 57.7, 57.1, 60.9, 80.9, 59.7, 58.2, 66.6),
time1_systolicBP = c(102, 152, NA_real_, 170, 151, 135, 162, 133, 131, 117),
time1_med_hypt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
time2_age = c(74.2, 69.2, 67.8, 58.9, 58.4, 62.5, 82.2, 61, 59.5, 67.8),
time2_systolicBP = c(NA_real_, 146, NA_real_, 151, 129, 129, 137, 144, NA_real_, 132),
time2_med_hypt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
time3_age = c(78, 74.2, 72.8, 64.1, 63.3, 67.7, 87.1, 66, 64.5, 72.9),
time3_systolicBP = c(NA_real_, 160.5, NA_real_, 171, 135, 160, 151, 166, 129, 150.5),
time3_med_hypt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)),
row.names = c(NA, 10L), class = "data.frame")
# Pivoting to a longer format
long_data <- wide_data %>%
pivot_longer(
cols=!id,
names_to=c(".value", "time"),
names_sep="_",
values_drop_na=FALSE
)
This produces the following tibble:
# A tibble: 40 x 6
id time sex time1 time2 time3
<dbl> <chr> <fct> <dbl> <dbl> <dbl>
1 12002 NA women NA NA NA
2 12002 age NA 71.2 74.2 78
3 12002 systolicBP NA 102 NA NA
4 12002 med NA 0 0 0
5 17001 NA men NA NA NA
6 17001 age NA 67.9 69.2 74.2
7 17001 systolicBP NA 152 146 160.
8 17001 med NA 0 0 0
9 17002 NA women NA NA NA
10 17002 age NA 66.5 67.8 72.8
# ... with 30 more rows
What I want is for the column names to be id, time, age, sex, systolicBP, and med_hypt. With 3 rows per patient corresponding to the 3 repeated measurements.
Upvotes: 1
Views: 3888
Reputation: 21938
This probably adds nothing new to the already posted solutions, the only difference is the regex
used for the names_pattern
argument.
_
whereas others are separated by two _
. \\w+
captures any word character, now if I specify we have a number after this with \\d+
as in time3
in time3_age
, we tell pivot_longer
to store this part of the column names corresponding to time3
in time
column. Then the rest of the column names are used for the variable names we are trying to measure line age
, systolicBP
and med_hypt
.\\w+\\d+
instead of \\w+
only the rest will be captured as column names whether it is med_hypt
with underscore or systolicBP
without underscore. But if we use only \\w+
it could also capture med and the resulting column will be hypt
instead of med_hypt
.names_pattern
or names_sep
in a way to specify how each of them are defined and separated.library(dplyr)
wide_data %>%
pivot_longer(!c(id, sex), names_to = c("time", ".value"),
names_pattern = "(\\w+\\d+)_(\\w+)")
# A tibble: 30 x 6
id sex time age systolicBP med_hypt
<dbl> <fct> <chr> <dbl> <dbl> <dbl>
1 12002 women time1 71.2 102 0
2 12002 women time2 74.2 NA 0
3 12002 women time3 78 NA 0
4 17001 men time1 67.9 152 0
5 17001 men time2 69.2 146 0
6 17001 men time3 74.2 160. 0
7 17002 women time1 66.5 NA 0
8 17002 women time2 67.8 NA 0
9 17002 women time3 72.8 NA 0
10 42001 men time1 57.7 170 0
# ... with 20 more rows
Upvotes: 6
Reputation: 389135
Since you have more than one underscore in certain column names it is better to use names_pattern
instead of names_sep
. names_pattern
allows us to pass flexible regex patterns to capture from the column name.
tidyr::pivot_longer(wide_data,
cols=-c(id, sex),
names_to=c("time", ".value"),
names_pattern = '(.*?)_(.*)$',
)
# id sex time age systolicBP med_hypt
# <dbl> <fct> <chr> <dbl> <dbl> <dbl>
# 1 12002 women time1 71.2 102 0
# 2 12002 women time2 74.2 NA 0
# 3 12002 women time3 78 NA 0
# 4 17001 men time1 67.9 152 0
# 5 17001 men time2 69.2 146 0
# 6 17001 men time3 74.2 160. 0
# 7 17002 women time1 66.5 NA 0
# 8 17002 women time2 67.8 NA 0
# 9 17002 women time3 72.8 NA 0
#10 42001 men time1 57.7 170 0
# … with 20 more rows
Upvotes: 5
Reputation: 8880
if I understood correctly
wide_data %>%
pivot_longer(
cols=-c(id, sex),
names_to=c(".value", "time"),
names_sep = "_",
values_drop_na=FALSE
)
# A tibble: 30 x 6
id sex time time1 time2 time3
<dbl> <fct> <chr> <dbl> <dbl> <dbl>
1 12002 women age 71.2 74.2 78
2 12002 women systolicBP 102 NA NA
3 12002 women med 0 0 0
4 17001 men age 67.9 69.2 74.2
5 17001 men systolicBP 152 146 160.
6 17001 men med 0 0 0
7 17002 women age 66.5 67.8 72.8
8 17002 women systolicBP NA NA NA
9 17002 women med 0 0 0
10 42001 men age 57.7 58.9 64.1
Upvotes: 0