Reputation: 5088
Assume I have text strings that look something like this:
A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3
Here I want to identify sequences of markers (A
is a marker, I3
is a marker etc.) that leads up to a subsequence consisting only of IX
markers (i.e. I1
, I2
, or I3
) that contains an I3
. This subsequence can have a length of 1 (i.e. be a single I3
marker) or it can be of unlimited length, but always needs to contain at least 1 I3
marker, and can only contain IX
markers. In the subsequence that leads up to the IX
subsequence, I1
and I2
can be included, but never I3
.
In the string above I need to identify:
A-B-C-I1-I2-D-E-F
which leads up to the I1-I3
subsequence which contains I3
and
D-D-D-D
which leads up to the I1-I1-I2-I1-I1-I3-I3
subsequence that contains at least 1 I3
.
Here are a few additional examples:
A-B-I3-C-I3
from this string we should identify A-B
because it is followed by a subsequence of 1 that contains I3
, and also C
, because it is followed by a subsequence of 1 that contains I3
.
and:
I3-A-I3
here A
should be identified because it is followed by a subsequence of 1 which contains I3
. The first I3
itself will not be identified, because we are only interested in subsequences that are followed by a subsequence of IX
markers that contains I3
.
How can I write a generic function/regex that accomplishes this task?
Upvotes: 3
Views: 86
Reputation: 174696
Use strsplit
> x <- "A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3"
> strsplit(x, "(?:-?I\\d+)*-?\\bI3-?(?:I\\d+-?)*")
[[1]]
[1] "A-B-C-I1-I2-D-E-F" "D-D-D-D"
> strsplit("A-B-I3-C-I3", "(?:-?I\\d+)*-?\\bI3\\b-?(?:I\\d+-?)*")
[[1]]
[1] "A-B" "C"
or
> strsplit("A-B-I3-C-I3", "(?:-?I\\d+)*-?\\bI3\\b-?(?:I3-?)*")
[[1]]
[1] "A-B" "C"
Upvotes: 4
Reputation: 107287
You can identify the sequences which contains I3
with following regex :
(?:I\\d-?)*I3(?:-?I\\d)*
So you can split your text with this regex to get the desire result.
See demo https://regex101.com/r/bJ3iA3/4
Upvotes: 1
Reputation: 850
Try the following expression: (.*?)(?:I[0-9]-)*I3(?:-I[0-9])*
.
See the match groups:
https://regex101.com/r/yA6aV9/1
Upvotes: 0