Extract string between the last occurrence of a character and a fixed expression

Question

I have a set of strings such as

mystring
[1] "RData/processed_AutoServico_cat.rds"
[2] "RData/processed_AutoServico_cat_master.rds"

I would like to retrieve the string between the last occurrence of a underscore "_" and ".rds"

I can do it in two steps

str_extract(mystring, '[^_]+$') %>% # get everything after the last '_'
    str_extract('.+(?=\.rds)') # get everything that preceeds '.rds' 
[1] "cat"    "master"

And there are other ways I can do it.

Is there any single regex expression that would get me all the characters between the last occurrence of a generic character and another fixed expression?

Regex such as

str_extract(mystring, '[^_]+$(?=\.rds)')
str_extract(mystring, '(?<=[_]).+$(?=\.rds)')

do not work

Wiktor Stribiżew · Accepted Answer

The [^_]+$(?=\.rds) pattern matches 1+ chars other than _ up to the end of the string, and then it requires .rds after the end of string, which is impossible, this regex will never match any string. (?<=[_]).+$(?=\.rds) is similar in that regard, it won't match any string, it just starts matching once it finds the first _ and will come to the end of string trying to find .rds after it.

You may use

str_extract(mystring, "[^_]+(?=\.rds$)")

Or, base R equivalent:

regmatches(s, regexpr("[^_]+(?=\.rds$)", s, perl=TRUE))

See the regex demo

Pattern details

[^_]+ - 1 or more chars other than _
(?=\.rds$) - a positive lookahead that requires .rds at the end of the string immediately to the right of the current location.

See the Regulex graph:

Extract string between the last occurrence of a character and a fixed expression

Answers (2)

Related Questions