Reputation: 2652
I have a set of strings such as
mystring
[1] "RData/processed_AutoServico_cat.rds"
[2] "RData/processed_AutoServico_cat_master.rds"
I would like to retrieve the string between the last occurrence of a underscore "_" and ".rds"
I can do it in two steps
str_extract(mystring, '[^_]+$') %>% # get everything after the last '_'
str_extract('.+(?=\\.rds)') # get everything that preceeds '.rds'
[1] "cat" "master"
And there are other ways I can do it.
Is there any single regex expression that would get me all the characters between the last occurrence of a generic character and another fixed expression?
Regex such as
str_extract(mystring, '[^_]+$(?=\\.rds)')
str_extract(mystring, '(?<=[_]).+$(?=\\.rds)')
do not work
Upvotes: 4
Views: 2479
Reputation: 887981
With base R
, we get the basename
and use sub
to capture the word before the .
followed by the characters that are not a .
till the end ($
) of the string and replace with the backreference (\\1
) of the captured group
sub(".*_(\\w+)\\.[^.]+$", "\\1", basename(mystring))
#[1] "cat" "master"
If it is a fixed character
sub(".*_(\\w+)\\.rds", "\\1", basename(mystring))
Or using gsub
gsub(".*_|\\.[^.]+$", "", mystring)
#[1] "cat" "master"
Upvotes: 1
Reputation: 627600
The [^_]+$(?=\.rds)
pattern matches 1+ chars other than _
up to the end of the string, and then it requires .rds
after the end of string, which is impossible, this regex will never match any string. (?<=[_]).+$(?=\.rds)
is similar in that regard, it won't match any string, it just starts matching once it finds the first _
and will come to the end of string trying to find .rds
after it.
You may use
str_extract(mystring, "[^_]+(?=\\.rds$)")
Or, base R equivalent:
regmatches(s, regexpr("[^_]+(?=\\.rds$)", s, perl=TRUE))
See the regex demo
Pattern details
[^_]+
- 1 or more chars other than _
(?=\.rds$)
- a positive lookahead that requires .rds
at the end of the string immediately to the right of the current location.See the Regulex graph:
Upvotes: 4