symbolrush
symbolrush

Reputation: 7457

`pivot_longer` operation with different naming schemes

I have a df of the form:

df <- tibble(
  id = c(1,2,3),
  x02val_a = c(0,1,0),
  x03val_a = c(1,0,0),
  x04val_a = c(0,1,1),
  x02val_b = c(0,2,0),
  x03val_b = c(1,3,0),
  x04val_b = c(0,1,2),
  age02 = c(1,2,3),
  age03 = c(2,3,4),
  age04 = c(3,4,5)
)

I want to bring it into tidy format like:

# A tibble: 9 x 5
     id year    val_a   val_b   age
  <dbl> <chr>   <dbl>   <dbl>   <dbl>
1     1 02          0       0       1
2     1 03          1       2       2
...

The answer from here worked for simpler naming schemes. With the naming scheme present in my real dataset, however, I struggle to define a regex that matches all patterns.

My attempts so far all missed one or the other schemes. I can grab the one with the variable name first and the year last (age02) or the one with the type and year first and the name last (x02var) but not both at the same time.

Is there a way to do this with a) a regex? or b) some combinations or parameterizations of the pivot_longer call(s)?

I know there is always the possibility to do it with a left join at the end as I described here


I tried to define the regex with two groups inside each other (since the groups are not strictly serial [meaning: left, right], which led me to):

df %>%
  pivot_longer(-id,names_to = c('.value', 'year'),names_pattern = '([a-z]+(\\d+)[a-z]+_[a-z])')

Upvotes: 1

Views: 444

Answers (2)

iago
iago

Reputation: 3256

Let's try. It seems this name pattern works:

> df %>% 
    pivot_longer(-id,
                  names_to = c('.value', 'year','.value'),
                  names_pattern = '([a-z]+)(\\d+)([a-z_]*)')
# A tibble: 9 x 5
     id year  xval_a xval_b   age
  <dbl> <chr>  <dbl>  <dbl> <dbl>
1     1 02         0      0     1
2     1 03         1      1     2
3     1 04         0      0     3
4     2 02         1      2     2
5     2 03         0      3     3
6     2 04         1      1     4
7     3 02         0      0     3
8     3 03         0      0     4
9     3 04         1      2     5

Upvotes: 3

camille
camille

Reputation: 16842

It's a bit roundabout, but because of the inconsistent name style, you might first rename your columns to match an easier pattern. There are 3 possible pieces of information in your names, but (at least in your example) each column has only 2 of these.

The relevant pieces are:

  • Multiple continuous matches to "[a-z_]", which either occurs after "x" or after the 2 digits. Whichever of these is present will get moved to the beginning of the name; whichever is not present will just not return anything and not take up any space.
  • 2 digits, which get moved to the end.

The parameterization possible with pivot_longer's ".value" option gives you column names in just one step based on this cleaner pattern. Should be trivial enough to adjust the pattern as needed, e.g. to fit a different number of digits.

library(dplyr)
library(tidyr)

df %>%
  rename_all(stringr::str_replace, "x?([a-z_]*)(\\d{2})([a-z_]*)", "\\1\\3\\2") %>%
  pivot_longer(-id, names_to = c(".value", "year"), names_pattern = "([a-z_]+)(\\d{2})")
#> # A tibble: 9 x 5
#>      id year  val_a val_b   age
#>   <dbl> <chr> <dbl> <dbl> <dbl>
#> 1     1 02        0     0     1
#> 2     1 03        1     1     2
#> 3     1 04        0     0     3
#> 4     2 02        1     2     2

Upvotes: 3

Related Questions