Separate column with tidyr by last occurance of string

Question

I have a column I'd like to separate:

df <- tibble(
  variable = c("var_a_min", "var_ab_max", "var_abc_mean", "var_abcd_sd"),
  value = c(1,2,3,4)
)

The data look like this:

# A tibble: 4 x 2
  variable     value
          
1 var_a_min        1
2 var_ab_max       2
3 var_abc_mean     3
4 var_abcd_sd      4

I'd like to separate the variable column, such that what's after the last underscore becomes the second column.

df %>% separate(variable, c("variable", "metric"), sep = [after last _])

I tried out some regex, but couldn't figure it out. The data should look like this:

# A tibble: 4 x 3
  variable metric value
        
1 var_a    min        1
2 var_ab   max        2
3 var_abc  mean       3
4 var_abcd sd         4

akrun · Accepted Answer

An option is extract to capture the characters as a group. In the firsst capture group, it is a greedy match ((.*) - zero or more characters), followed by a _ and in the second group (([^_]+)$), match characters that are not a _ until the end of the string ($). In this way, it make sure the first greedy match backtracks

library(tidyverse)
df %>% 
    extract(variable, into = c("variable", "metric"), "(.*)_([^_]+$)")

separate can take regex lookarounds as well, so if the prefix substring is 'var', then can make a lookaround with

df %>% 
  separate(variable, into = c("variable", "metric"), "(?      
#1 var_a    min        1
#2 var_ab   max        2
#3 var_abc  mean       3
#4 var_abcd sd         4

Separate column with tidyr by last occurance of string

Answers (1)

Related Questions