Extract substring using regular expression in R

Question

I am new to regular expression and have read http://www.gastonsanchez.com/Handling_and_Processing_Strings_in_R.pdf regex documents. I know similar questions have been posted previously, but I still had a difficult time trying to figuring out my case.

I have a vector of string filenames, try to extract substring, and save as new filenames. The filenames follow the the pattern below:

\w_\w_(substring to extract)_\d_\d_Month_Date_Year_Hour_Min_Sec_(AM or PM)

For example, ABC_DG_MS-15-0452-268_206_281_12_1_2017_1_53_11_PM, ABC_RE_SP56-01_A_206_281_12_1_2017_1_52_34_AM, the substring will be MS-15-0452-268 and SP56-01_A

I used

map(strsplit(filenames, '_'),3)

but failed, because the new filenames could have _, too.

I turned to regular expression for advanced matching, and come up with this

gsub("^[^
]+_\d_\d_\d_\d_(AM | PM)$", "", filenames)

still did not get what I needed.

Jan · Accepted Answer

You may use

filenames <- c('ABC_DG_MS-15-0452-268_206_281_12_1_2017_1_53_11_PM', 'ABC_RE_SP56-01_A_206_281_12_1_2017_1_52_34_AM')
gsub('^(?:[^_]+_){2}(.+?)_\d+.*', '\1', filenames)

Which yields

[1] "MS-15-0452-268" "SP56-01_A"

The pattern here is

^             # start of the string
(?:[^_]+_){2} # not _, twice
(.+?)         # anything lazily afterwards
_\d+         # until there's _\d+
.*            # consume the rest of the string

This pattern is replaced by the first captured group and hence the filename in question.

Extract substring using regular expression in R

Answers (2)

Related Questions