histelheim
histelheim

Reputation: 5088

Identifying substrings based on complex rules

Assume I have text strings that look something like this:

A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3

Here I want to identify sequences of markers (A is a marker, I3 is a marker etc.) that leads up to a subsequence consisting only of IX markers (i.e. I1, I2, or I3) that contains an I3. This subsequence can have a length of 1 (i.e. be a single I3 marker) or it can be of unlimited length, but always needs to contain at least 1 I3 marker, and can only contain IX markers. In the subsequence that leads up to the IX subsequence, I1 and I2 can be included, but never I3.

In the string above I need to identify:

A-B-C-I1-I2-D-E-F

which leads up to the I1-I3 subsequence which contains I3

and

D-D-D-D

which leads up to the I1-I1-I2-I1-I1-I3-I3 subsequence that contains at least 1 I3.

Here are a few additional examples:

A-B-I3-C-I3

from this string we should identify A-B because it is followed by a subsequence of 1 that contains I3, and also C, because it is followed by a subsequence of 1 that contains I3.

and:

I3-A-I3

here A should be identified because it is followed by a subsequence of 1 which contains I3. The first I3 itself will not be identified, because we are only interested in subsequences that are followed by a subsequence of IX markers that contains I3.

How can I write a generic function/regex that accomplishes this task?

Upvotes: 3

Views: 86

Answers (3)

Avinash Raj
Avinash Raj

Reputation: 174696

Use strsplit

> x <- "A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3"
> strsplit(x, "(?:-?I\\d+)*-?\\bI3-?(?:I\\d+-?)*")
[[1]]
[1] "A-B-C-I1-I2-D-E-F" "D-D-D-D"

> strsplit("A-B-I3-C-I3", "(?:-?I\\d+)*-?\\bI3\\b-?(?:I\\d+-?)*")
[[1]]
[1] "A-B" "C" 

or

> strsplit("A-B-I3-C-I3", "(?:-?I\\d+)*-?\\bI3\\b-?(?:I3-?)*")
[[1]]
[1] "A-B" "C"

Upvotes: 4

Kasravnd
Kasravnd

Reputation: 107287

You can identify the sequences which contains I3 with following regex :

(?:I\\d-?)*I3(?:-?I\\d)*

So you can split your text with this regex to get the desire result.

See demo https://regex101.com/r/bJ3iA3/4

Upvotes: 1

Uri Y
Uri Y

Reputation: 850

Try the following expression: (.*?)(?:I[0-9]-)*I3(?:-I[0-9])*. See the match groups: https://regex101.com/r/yA6aV9/1

Upvotes: 0

Related Questions