Jian
Jian

Reputation: 505

Extract substring using regular expression in R

I am new to regular expression and have read http://www.gastonsanchez.com/Handling_and_Processing_Strings_in_R.pdf regex documents. I know similar questions have been posted previously, but I still had a difficult time trying to figuring out my case.

I have a vector of string filenames, try to extract substring, and save as new filenames. The filenames follow the the pattern below:

\w_\w_(substring to extract)_\d_\d_Month_Date_Year_Hour_Min_Sec_(AM or PM)

For example, ABC_DG_MS-15-0452-268_206_281_12_1_2017_1_53_11_PM, ABC_RE_SP56-01_A_206_281_12_1_2017_1_52_34_AM, the substring will be MS-15-0452-268 and SP56-01_A

I used

map(strsplit(filenames, '_'),3)

but failed, because the new filenames could have _, too.

I turned to regular expression for advanced matching, and come up with this

gsub("^[^\n]+_\\d_\\d_\\d_\\d_(AM | PM)$", "", filenames)

still did not get what I needed.

Upvotes: 4

Views: 5030

Answers (2)

Andrew Taylor
Andrew Taylor

Reputation: 3488

Call me a hack. But if that is guaranteed to be the format of all my strings, then I would just use strsplit to hack the name apart, then only keep what I wanted:

string <- 'ABC_DG_MS-15-0452-268_206_281_12_1_2017_1_53_11_PM'
string_bits <- strsplit(string, '_')[[1]]
file_name<- string_bits[3]
file_name

[1] "MS-15-0452-268"

And if you had a list of many file names, you could remove the explicit [[1]] use sapply() to get the third element of every one:

sapply(string_bits, "[[", 3)

Upvotes: 0

Jan
Jan

Reputation: 43169

You may use

filenames <- c('ABC_DG_MS-15-0452-268_206_281_12_1_2017_1_53_11_PM', 'ABC_RE_SP56-01_A_206_281_12_1_2017_1_52_34_AM')
gsub('^(?:[^_]+_){2}(.+?)_\\d+.*', '\\1', filenames)

Which yields

[1] "MS-15-0452-268" "SP56-01_A"    


The pattern here is

^             # start of the string
(?:[^_]+_){2} # not _, twice
(.+?)         # anything lazily afterwards
_\\d+         # until there's _\d+
.*            # consume the rest of the string

This pattern is replaced by the first captured group and hence the filename in question.

Upvotes: 2

Related Questions