Claudio
Claudio

Reputation: 1538

Specify separator character in "separate" function from package tidyr

I have a tibble with a column containing strings in the format XX_YY_ZZ. I need to separate the column into two different columns using the first underscore as a separation character.

The function tidyr::separate will match both underscores and discard the piece of string after the second one. Googling around I've found the regex "(?!.*)" that matches only the last underscore, but I've not been able to find one matching only the first one.

My data look like:

d <- tibble(var = paste0(LETTERS[1:5], "_", 1:5, "_", letters[1:5])) 
d %>% separate(var, into=c("newcol1", "newcol2"))

# A tibble: 5 x 2
  newcol1 newcol2
  <chr>   <chr>  
1 A       1      
2 B       2      
3 C       3      
4 D       4      
5 E       5      
Warning message:
Expected 2 pieces. Additional pieces discarded in 5 rows [1, 2, 3, 4, 5]. 

What I need to have is:

# A tibble: 5 x 2
  newcol1 newcol2
  <chr>   <chr>  
1 A       1_a
2 B       2_b
3 C       3_c
4 D       4_d
5 E       5_e

Upvotes: 0

Views: 1427

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269694

1) Specify extra = "merge"

d %>% 
  separate(var, into=c("newcol1", "newcol2"), extra = "merge", sep = "_")

2) or replace the first underscore with two underscores and then split on that.

d %>% 
  mutate(var = sub("_", "__", var)) %>% 
  separate(var, into=c("newcol1", "newcol2"), sep = "__")

3) or just use mutate and sub instead of separate:

d %>%
  mutate(newcol1 = sub("_.*", "", var), newcol2 = sub(".*?_", "", var), var = NULL)

Upvotes: 2

Sven Hohenstein
Sven Hohenstein

Reputation: 81693

Here is a way to solve the problem.

d %>% separate(var, into = c("newcol1", "newcol2"), sep = "_(?=.*_)")

Here, the regex _(?=.*_) means: _ followed by a string including another _.

The result:

# A tibble: 5 x 2
  newcol1 newcol2
  <chr>   <chr>  
1 A       1_a    
2 B       2_b    
3 C       3_c    
4 D       4_d    
5 E       5_e   

Upvotes: 1

Related Questions