marc1s
marc1s

Reputation: 779

How to extract parts from a string

I have an string called PATTERN:

PATTERN <- "MODEL_Name.model-OUTCOME_any.outcome-IMP_number"

and I would like to parse the string using a pattern matching function, like grep, sub, ... to obtain a string variable MODEL equal to "Name.model", a string variable OUTCOME equal to "any.outcome" and an integer variable IMP equal to number.

If MODEL, OUTCOME and IMP were all integers, I could get the values using function sub:

PATTERN <- "MODEL_002-OUTCOME_007-IMP_001"
pattern_build <- "MODEL_([0-9]+)-OUTCOME_([0-9]+)-IMP_([0-9]+)"

MODEL <- as.integer(sub(pattern_build, "\\1", PATTERN))
OUTCOME <- as.integer(sub(pattern_build, "\\2", PATTERN))
IMP <- as.integer(sub(pattern_build, "\\3", PATTERN))

Do you have any idea of how to match the string contained in variable PATTERN?

Possible tricky patterns are:

PATTERN <- "MODEL_PS2-OUTCOME_stroke_i-IMP_001"
PATTERN <- "MODEL_linear-model-OUTCOME_stroke_i-IMP_001"

Upvotes: 1

Views: 80

Answers (3)

Sotos
Sotos

Reputation: 51592

A minimal-regex approach,

sapply(strsplit(PATTERN, '-'), function(i) sub('(.*?_){1}', '', i))
#     [,1]      
#[1,] "PS2"     
#[2,] "stroke_i"
#[3,] "001"     

Upvotes: 3

Jaap
Jaap

Reputation: 83215

A solution which is also able to deal with the 'tricky' patterns:

PATTERN <- "MODEL_linear-model-OUTCOME_stroke_i-IMP_001"

lst <- strsplit(PATTERN, '([A-Z]+_)')[[1]][2:4]
lst <- sub('-$','',lst)

which gives:

> lst
[1] "linear-model" "stroke_i"     "001"

And if you want that in a dataframe:

df <- as.data.frame.list(lst)
names(df) <- c('MODEL','OUTCOME','IMP')

which gives:

> df
         MODEL  OUTCOME IMP
1 linear-model stroke_i 001

Upvotes: 4

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

You may use a pattern with capturing groups matching any chars, as few as possible between known delimiting substrings:

MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)

See the regex demo. Note that the last .* is greedy since you get all the rest of the string into this capture.

You may precise this pattern to only allow matching expected characters (say, to match digits into the last capturing group, use ([0-9]+) rather than (.*).

Use it with, say, str_match from stringr:

> library(stringr)
> x <- "MODEL_Name.model-OUTCOME_any.outcome-IMP_number"
> res <- str_match(x, "MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)")
> res[,2]
[1] "Name.model"
> res[,3]
[1] "any.outcome"
> res[,4]
[1] "number"
> 

A base R solution using the same regex will involve a regmatches / regexec:

> res <- regmatches(x, regexec("MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)", x))[[1]]
> res[2]
[1] "Name.model"
> res[3]
[1] "any.outcome"
> res[4]
[1] "number"
> 

Upvotes: 1

Related Questions