Reputation: 1538
I have a tibble with a column containing strings in the format XX_YY_ZZ. I need to separate the column into two different columns using the first underscore as a separation character.
The function tidyr::separate will match both underscores and discard the piece of string after the second one. Googling around I've found the regex "(?!.*)" that matches only the last underscore, but I've not been able to find one matching only the first one.
My data look like:
d <- tibble(var = paste0(LETTERS[1:5], "_", 1:5, "_", letters[1:5]))
d %>% separate(var, into=c("newcol1", "newcol2"))
# A tibble: 5 x 2
newcol1 newcol2
<chr> <chr>
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
Warning message:
Expected 2 pieces. Additional pieces discarded in 5 rows [1, 2, 3, 4, 5].
What I need to have is:
# A tibble: 5 x 2
newcol1 newcol2
<chr> <chr>
1 A 1_a
2 B 2_b
3 C 3_c
4 D 4_d
5 E 5_e
Upvotes: 0
Views: 1427
Reputation: 269694
1) Specify extra = "merge"
d %>%
separate(var, into=c("newcol1", "newcol2"), extra = "merge", sep = "_")
2) or replace the first underscore with two underscores and then split on that.
d %>%
mutate(var = sub("_", "__", var)) %>%
separate(var, into=c("newcol1", "newcol2"), sep = "__")
3) or just use mutate
and sub
instead of separate
:
d %>%
mutate(newcol1 = sub("_.*", "", var), newcol2 = sub(".*?_", "", var), var = NULL)
Upvotes: 2
Reputation: 81693
Here is a way to solve the problem.
d %>% separate(var, into = c("newcol1", "newcol2"), sep = "_(?=.*_)")
Here, the regex _(?=.*_)
means: _
followed by a string including another _
.
The result:
# A tibble: 5 x 2
newcol1 newcol2
<chr> <chr>
1 A 1_a
2 B 2_b
3 C 3_c
4 D 4_d
5 E 5_e
Upvotes: 1