ajax2000
ajax2000

Reputation: 711

Extract substring in R using grepl

I have a table with a string column formatted like this

abcdWorkstart.csv
abcdWorkcomplete.csv

And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.

grepl("Work{*}.csv", data$filename)

Basically I want to extract whatever between Work and .csv

desired outcome:

start
complete

Upvotes: 10

Views: 4381

Answers (4)

Andre Elrico
Andre Elrico

Reputation: 11480

Just as an alternative way, remove everything you don't want.

x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

gsub("^.*Work|\\.csv$", "", x)
#[1] "start"    "complete"

please note: I have to use gsub. Because I first remove ^.*Work then \\.csv$.


For [\\s\\S] or \\d\\D ... (does not work with [g]?sub)

https://regex101.com/r/wFgkgG/1

Works with akruns approach:

regmatches(v1, regexpr("(?<=Work)[\\s\\S]+(?=[.]csv)", v1, perl = T))

str1<-
'12
.2
12'

gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)

. matches also \n when using the R engine.

Upvotes: 6

akrun
akrun

Reputation: 886948

Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches

regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start"    "complete"

data

v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')

Upvotes: 5

acylam
acylam

Reputation: 18661

With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":

x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

library(stringr)
str_extract(x, "(?<=Work).+(?=\\.csv)")
# [1] "start"    "complete"

Upvotes: 7

r2evans
r2evans

Reputation: 160407

I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:

fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\\.csv$", "\\1", fn)
out
# [1] "start"           "complete"        "abcdNothing.csv"

You can work around this by filtering out the unchanged ones:

out[ out != fn ]
# [1] "start"    "complete"

Or marking them invalid with NA (or something else):

out[ out == fn ] <- NA
out
# [1] "start"    "complete" NA        

Upvotes: 10

Related Questions