Felipe Alvarenga
Felipe Alvarenga

Reputation: 2652

Extract string between the last occurrence of a character and a fixed expression

I have a set of strings such as

mystring
[1] "RData/processed_AutoServico_cat.rds"
[2] "RData/processed_AutoServico_cat_master.rds"

I would like to retrieve the string between the last occurrence of a underscore "_" and ".rds"

I can do it in two steps

str_extract(mystring, '[^_]+$') %>% # get everything after the last '_'
    str_extract('.+(?=\\.rds)') # get everything that preceeds '.rds' 
[1] "cat"    "master"

And there are other ways I can do it.

Is there any single regex expression that would get me all the characters between the last occurrence of a generic character and another fixed expression?

Regex such as

str_extract(mystring, '[^_]+$(?=\\.rds)')
str_extract(mystring, '(?<=[_]).+$(?=\\.rds)')

do not work

Upvotes: 4

Views: 2479

Answers (2)

akrun
akrun

Reputation: 887981

With base R, we get the basename and use sub to capture the word before the . followed by the characters that are not a . till the end ($) of the string and replace with the backreference (\\1) of the captured group

sub(".*_(\\w+)\\.[^.]+$", "\\1", basename(mystring))
#[1] "cat"    "master"

If it is a fixed character

sub(".*_(\\w+)\\.rds", "\\1", basename(mystring))

Or using gsub

gsub(".*_|\\.[^.]+$", "", mystring)
#[1] "cat"    "master"

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627600

The [^_]+$(?=\.rds) pattern matches 1+ chars other than _ up to the end of the string, and then it requires .rds after the end of string, which is impossible, this regex will never match any string. (?<=[_]).+$(?=\.rds) is similar in that regard, it won't match any string, it just starts matching once it finds the first _ and will come to the end of string trying to find .rds after it.

You may use

str_extract(mystring, "[^_]+(?=\\.rds$)")

Or, base R equivalent:

regmatches(s, regexpr("[^_]+(?=\\.rds$)", s, perl=TRUE)) 

See the regex demo

Pattern details

  • [^_]+ - 1 or more chars other than _
  • (?=\.rds$) - a positive lookahead that requires .rds at the end of the string immediately to the right of the current location.

See the Regulex graph:

enter image description here

Upvotes: 4

Related Questions