Stuart
Stuart

Reputation: 607

tidyr::pivot_longer with complex names_pattern

I have a data frame called all.data that looks like this:

 A tibble: 10 x 5
   survey_id eb.estimate.2021 se.rescaled.2021 eb.estimate.2022 se.rescaled.2022
       <dbl>            <dbl>            <dbl>            <dbl>            <dbl>
 1       606             4.93            1.11             5.70             1.23 
 2       610             7.27            0.696            7.35             0.579
 3       571             2.37            0.636            0.345            0.998
 4       590             6.00            0.659            6.36             0.791
 5       616             2.79            0.825            5.08             1.17 
 6       617             3.11            0.633            1.68             0.779
 7       596             5.97            0.857            5.36             0.994
 8       581             5.26            0.975            4.79             0.735
 9       577             7.11            0.903            5.52             0.999
10       598             5.10            0.650            4.71             0.776

I want to pivot_longer so it looks like this:

 A tibble: 10 x 4
   survey_id year     se eb.est
       <dbl> <chr> <dbl>  <dbl>
 1       606 2021  1.11   4.93 
 2       606 2022  1.23   5.70 
 3       610 2021  0.696  7.27 
 4       610 2022  0.579  7.35 
 5       571 2021  0.636  2.37 
 6       571 2022  0.998  0.345
 7       590 2021  0.659  6.00 
 8       590 2022  0.791  6.36 
 9       616 2021  0.825  2.79 
10       616 2022  1.17   5.08 

I was able to do it by doing pivot_longer twice: once for the eb.estimates and once for the se.rescaleds and then merging them by year and survey_id. One of these pivot_longers is like this:

foo1 <- all.data %>%
    pivot_longer(cols=c(se.rescaled.2021, se.rescaled.2022),  names_to="year",
                 names_pattern="(?:se\\.rescaled\\.)(\\d{4})",
                 values_to="se") %>%
    mutate(se=as.numeric(se))

This method offends my aesthetic and efficiency senses. There should be a way to do it in one pivot_longer but I'm not having success. I tried this:

all.data.long2 <- all.data %>%
    pivot_longer(cols=c("eb.estimate.2021", "eb.estimate.2022", "se.rescaled.2021", "se.rescaled.2022"),
                 names_to= ".value",
                 names_pattern="((eb\\.estimate|se\\.rescaled))(?:\\.)(\\d{4})")

But this gives me an error message: Error: regex should define 1 groups; 3 found. (Actually, I don't understand why it found 3 groups; there are only 2 capturing groups in that regex.) What I thought would be logical is to be able to select the matched segments with an index as we do with a function like str_match(), sort of like this:

names_to=names_pattern[3]
values_to=names_pattern[1]

Is there some way to do this? Thanks in advance

Upvotes: 1

Views: 143

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388982

You may use -

tidyr::pivot_longer(all.data, cols = -survey_id, 
                    names_to = c('.value', 'year'),
                    names_pattern = '(eb.est|se).*\\.(\\d+)')

#   survey_id year  eb.est    se
#       <int> <chr>  <dbl> <dbl>
# 1       606 2021   4.93  1.11 
# 2       606 2022   5.7   1.23 
# 3       610 2021   7.27  0.696
# 4       610 2022   7.35  0.579
# 5       571 2021   2.37  0.636
# 6       571 2022   0.345 0.998
# 7       590 2021   6     0.659
# 8       590 2022   6.36  0.791
# 9       616 2021   2.79  0.825
#10       616 2022   5.08  1.17 
#11       617 2021   3.11  0.633
#12       617 2022   1.68  0.779
#13       596 2021   5.97  0.857
#14       596 2022   5.36  0.994
#15       581 2021   5.26  0.975
#16       581 2022   4.79  0.735
#17       577 2021   7.11  0.903
#18       577 2022   5.52  0.999
#19       598 2021   5.1   0.65 
#20       598 2022   4.71  0.776

.value corresponds to the pattern (eb.est|se) and the number (\\d+) is saved in a new column called year.

data

all.data <- structure(list(survey_id = c(606L, 610L, 571L, 590L, 616L, 617L, 
596L, 581L, 577L, 598L), eb.estimate.2021 = c(4.93, 7.27, 2.37, 
6, 2.79, 3.11, 5.97, 5.26, 7.11, 5.1), se.rescaled.2021 = c(1.11, 
0.696, 0.636, 0.659, 0.825, 0.633, 0.857, 0.975, 0.903, 0.65), 
    eb.estimate.2022 = c(5.7, 7.35, 0.345, 6.36, 5.08, 1.68, 
    5.36, 4.79, 5.52, 4.71), se.rescaled.2022 = c(1.23, 0.579, 
    0.998, 0.791, 1.17, 0.779, 0.994, 0.735, 0.999, 0.776)), 
class = "data.frame", row.names = c(NA, -10L))

Upvotes: 2

Related Questions