Reputation: 505
I am new to regular expression and have read http://www.gastonsanchez.com/Handling_and_Processing_Strings_in_R.pdf regex documents. I know similar questions have been posted previously, but I still had a difficult time trying to figuring out my case.
I have a vector of string filenames, try to extract substring, and save as new filenames. The filenames follow the the pattern below:
\w_\w_(substring to extract)_\d_\d_Month_Date_Year_Hour_Min_Sec_(AM or PM)
For example, ABC_DG_MS-15-0452-268_206_281_12_1_2017_1_53_11_PM, ABC_RE_SP56-01_A_206_281_12_1_2017_1_52_34_AM, the substring will be MS-15-0452-268 and SP56-01_A
I used
map(strsplit(filenames, '_'),3)
but failed, because the new filenames could have _, too.
I turned to regular expression for advanced matching, and come up with this
gsub("^[^\n]+_\\d_\\d_\\d_\\d_(AM | PM)$", "", filenames)
still did not get what I needed.
Upvotes: 4
Views: 5030
Reputation: 3488
Call me a hack. But if that is guaranteed to be the format of all my strings, then I would just use strsplit
to hack the name apart, then only keep what I wanted:
string <- 'ABC_DG_MS-15-0452-268_206_281_12_1_2017_1_53_11_PM'
string_bits <- strsplit(string, '_')[[1]]
file_name<- string_bits[3]
file_name
[1] "MS-15-0452-268"
And if you had a list of many file names, you could remove the explicit [[1]]
use sapply()
to get the third element of every one:
sapply(string_bits, "[[", 3)
Upvotes: 0
Reputation: 43169
You may use
filenames <- c('ABC_DG_MS-15-0452-268_206_281_12_1_2017_1_53_11_PM', 'ABC_RE_SP56-01_A_206_281_12_1_2017_1_52_34_AM')
gsub('^(?:[^_]+_){2}(.+?)_\\d+.*', '\\1', filenames)
Which yields
[1] "MS-15-0452-268" "SP56-01_A"
^ # start of the string
(?:[^_]+_){2} # not _, twice
(.+?) # anything lazily afterwards
_\\d+ # until there's _\d+
.* # consume the rest of the string
This pattern is replaced by the first captured group and hence the filename in question.
Upvotes: 2