tho_mi
tho_mi

Reputation: 718

Regular expressions, extract specific parts of pattern

I haven't worked with regular expressions for quite some time, so I'm not sure if what I want to do can be done "directly" or if I have to work around.

My expressions look like the following two:

crb_gdp_g_100000_16_16_ftv_all.txt
crt_r_g_25000_20_40_flin_g_2.txt

Only the parts replaced by a asterisk are "varying", the other stuff is constant (or irrelevant, as in the case of the last part (after "f*_"):

cr*_*_g_*_*_*_f*_

Is there a straightfoward way to get only the values of the asterisk-parts? E.g. in case of "r" or "gdp" I have to include underscores, otherwise I get the r at the beginning of the expression. Including the underscores gives "r" or "gdp", but I only want "r" or "gdp".

Or in short: I know a lot about my expressions but I only want to extract the varying parts. (How) Can I do that?

Upvotes: 1

Views: 179

Answers (2)

lmo
lmo

Reputation: 38520

We can also use regmatches and regexec to extract these values like this:

regmatches(str, regexec("^cr([^_]+)_([^_]+)_g_([^_]+)_([^_]+)_([^_]+)_f([^_]+)_.*$", str))
[[1]]
[1] "crb_gdp_g_100000_16_16_ftv_all.txt" "b"                                 
[3] "gdp"                                "100000"                            
[5] "16"                                 "16"                                
[7] "tv"                                

[[2]]
[1] "crt_r_g_25000_20_40_flin_g_2.txt" "t"                                "r"
[4] "25000"                            "20"                               "40"
[7] "lin"  

Note that the first element in each vector is the full string, so to drop that, we can use lapply and "["

lapply(regmatches(str, 
                 regexec("^cr([^_]+)_([^_]+)_g_([^_]+)_([^_]+)_([^_]+)_f([^_]+)_.*$", str)),
       "[", -1)
[[1]]
[1] "b"      "gdp"    "100000" "16"     "16"     "tv"    

[[2]]
[1] "t"     "r"     "25000" "20"    "40"    "lin"

Upvotes: 2

Cath
Cath

Reputation: 24074

You can use sub with captures and then strsplit to get a list of the separated elements:

str <- c("crb_gdp_g_100000_16_16_ftv_all.txt", "crt_r_g_25000_20_40_flin_g_2.txt")
strsplit(sub("cr([[:alnum:]]+)_([[:alnum:]]+)_g_([[:alnum:]]+)_([[:alnum:]]+)_([[:alnum:]]+)_f([[:alnum:]]+)_.+", "\\1.\\2.\\3.\\4.\\5.\\6", str), "\\.")
#[[1]]
#[1] "b"      "gdp"    "100000" "16"     "16"     "tv"    
#[[2]]
#[1] "t"     "r"     "25000" "20"    "40"    "lin"

Note: I replaced \\w with [[:alnum:]] to avoid inclusion of the underscore.

Upvotes: 3

Related Questions