Reputation: 711
I have a table with a string column formatted like this
abcdWorkstart.csv
abcdWorkcomplete.csv
And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.
grepl("Work{*}.csv", data$filename)
Basically I want to extract whatever between Work and .csv
desired outcome:
start
complete
Upvotes: 10
Views: 4381
Reputation: 11480
Just as an alternative way, remove everything you don't want.
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
gsub("^.*Work|\\.csv$", "", x)
#[1] "start" "complete"
please note:
I have to use gsub
. Because I first remove ^.*Work
then \\.csv$
.
For [\\s\\S]
or \\d\\D
... (does not work with [g]?sub)
https://regex101.com/r/wFgkgG/1
Works with akruns approach:
regmatches(v1, regexpr("(?<=Work)[\\s\\S]+(?=[.]csv)", v1, perl = T))
str1<-
'12
.2
12'
gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)
.
matches also \n
when using the R engine.
Upvotes: 6
Reputation: 886948
Here is an option using regmatches/regexpr
from base R
. Using a regex lookaround to match all characters that are not a .
after the string 'Work', extract with regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
Upvotes: 5
Reputation: 18661
With str_extract
from stringr
. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
library(stringr)
str_extract(x, "(?<=Work).+(?=\\.csv)")
# [1] "start" "complete"
Upvotes: 7
Reputation: 160407
I think you need sub
or gsub
(substitute/extract) instead of grepl
(find if match exists). Note that when not found, it will return the entire string unmodified:
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\\.csv$", "\\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA
(or something else):
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA
Upvotes: 10