Reputation: 7457
I have a df
of the form:
df <- tibble(
id = c(1,2,3),
x02val_a = c(0,1,0),
x03val_a = c(1,0,0),
x04val_a = c(0,1,1),
x02val_b = c(0,2,0),
x03val_b = c(1,3,0),
x04val_b = c(0,1,2),
age02 = c(1,2,3),
age03 = c(2,3,4),
age04 = c(3,4,5)
)
I want to bring it into tidy format like:
# A tibble: 9 x 5
id year val_a val_b age
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 02 0 0 1
2 1 03 1 2 2
...
The answer from here worked for simpler naming schemes. With the naming scheme present in my real dataset, however, I struggle to define a regex that matches all patterns.
My attempts so far all missed one or the other schemes. I can grab the one with the variable name first and the year last (age02
) or the one with the type and year first and the name last (x02var
) but not both at the same time.
Is there a way to do this with a) a regex? or b) some combinations or parameterizations of the pivot_longer
call(s)?
I know there is always the possibility to do it with a left join at the end as I described here
I tried to define the regex with two groups inside each other (since the groups are not strictly serial [meaning: left, right], which led me to):
df %>%
pivot_longer(-id,names_to = c('.value', 'year'),names_pattern = '([a-z]+(\\d+)[a-z]+_[a-z])')
Upvotes: 1
Views: 444
Reputation: 3256
Let's try. It seems this name pattern works:
> df %>%
pivot_longer(-id,
names_to = c('.value', 'year','.value'),
names_pattern = '([a-z]+)(\\d+)([a-z_]*)')
# A tibble: 9 x 5
id year xval_a xval_b age
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 02 0 0 1
2 1 03 1 1 2
3 1 04 0 0 3
4 2 02 1 2 2
5 2 03 0 3 3
6 2 04 1 1 4
7 3 02 0 0 3
8 3 03 0 0 4
9 3 04 1 2 5
Upvotes: 3
Reputation: 16842
It's a bit roundabout, but because of the inconsistent name style, you might first rename your columns to match an easier pattern. There are 3 possible pieces of information in your names, but (at least in your example) each column has only 2 of these.
The relevant pieces are:
"[a-z_]"
, which either occurs after "x" or after the 2 digits. Whichever of these is present will get moved to the beginning of the name; whichever is not present will just not return anything and not take up any space.The parameterization possible with pivot_longer
's ".value"
option gives you column names in just one step based on this cleaner pattern. Should be trivial enough to adjust the pattern as needed, e.g. to fit a different number of digits.
library(dplyr)
library(tidyr)
df %>%
rename_all(stringr::str_replace, "x?([a-z_]*)(\\d{2})([a-z_]*)", "\\1\\3\\2") %>%
pivot_longer(-id, names_to = c(".value", "year"), names_pattern = "([a-z_]+)(\\d{2})")
#> # A tibble: 9 x 5
#> id year val_a val_b age
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 02 0 0 1
#> 2 1 03 1 1 2
#> 3 1 04 0 0 3
#> 4 2 02 1 2 2
Upvotes: 3