Reputation: 779
I have an string called PATTERN:
PATTERN <- "MODEL_Name.model-OUTCOME_any.outcome-IMP_number"
and I would like to parse the string using a pattern matching function, like grep
, sub
, ... to obtain a string variable MODEL equal to "Name.model", a string variable OUTCOME equal to "any.outcome" and an integer variable IMP equal to number.
If MODEL, OUTCOME and IMP were all integers, I could get the values using function sub
:
PATTERN <- "MODEL_002-OUTCOME_007-IMP_001"
pattern_build <- "MODEL_([0-9]+)-OUTCOME_([0-9]+)-IMP_([0-9]+)"
MODEL <- as.integer(sub(pattern_build, "\\1", PATTERN))
OUTCOME <- as.integer(sub(pattern_build, "\\2", PATTERN))
IMP <- as.integer(sub(pattern_build, "\\3", PATTERN))
Do you have any idea of how to match the string contained in variable PATTERN?
Possible tricky patterns are:
PATTERN <- "MODEL_PS2-OUTCOME_stroke_i-IMP_001"
PATTERN <- "MODEL_linear-model-OUTCOME_stroke_i-IMP_001"
Upvotes: 1
Views: 80
Reputation: 51592
A minimal-regex approach,
sapply(strsplit(PATTERN, '-'), function(i) sub('(.*?_){1}', '', i))
# [,1]
#[1,] "PS2"
#[2,] "stroke_i"
#[3,] "001"
Upvotes: 3
Reputation: 83215
A solution which is also able to deal with the 'tricky' patterns:
PATTERN <- "MODEL_linear-model-OUTCOME_stroke_i-IMP_001"
lst <- strsplit(PATTERN, '([A-Z]+_)')[[1]][2:4]
lst <- sub('-$','',lst)
which gives:
> lst
[1] "linear-model" "stroke_i" "001"
And if you want that in a dataframe:
df <- as.data.frame.list(lst)
names(df) <- c('MODEL','OUTCOME','IMP')
which gives:
> df
MODEL OUTCOME IMP
1 linear-model stroke_i 001
Upvotes: 4
Reputation: 626845
You may use a pattern with capturing groups matching any chars, as few as possible between known delimiting substrings:
MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)
See the regex demo. Note that the last .*
is greedy since you get all the rest of the string into this capture.
You may precise this pattern to only allow matching expected characters (say, to match digits into the last capturing group, use ([0-9]+)
rather than (.*)
.
Use it with, say, str_match
from stringr:
> library(stringr)
> x <- "MODEL_Name.model-OUTCOME_any.outcome-IMP_number"
> res <- str_match(x, "MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)")
> res[,2]
[1] "Name.model"
> res[,3]
[1] "any.outcome"
> res[,4]
[1] "number"
>
A base R solution using the same regex will involve a regmatches
/ regexec
:
> res <- regmatches(x, regexec("MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)", x))[[1]]
> res[2]
[1] "Name.model"
> res[3]
[1] "any.outcome"
> res[4]
[1] "number"
>
Upvotes: 1