Reputation: 545
I would like to extract the text between "one: " and "two: " and between "two: " and "three: " in the string s1 "one: bla 1 two: bla2 three: bla3". However "two: bla2 " is not necessarily present in the string s2. So if it is s2 "one: bla 1 three: bla3" it should also work.
I've come up with the following R-Code, but my attempt with the additional parentheses around the "two:..." and the question mark doesn't work:
library(gsubfn)
s1 <- "one: bla 1 two: bla2 three: bla3"
s2 <- "one: bla 1 three: bla3"
strapplyc(s1, "one: (.*) (two: (.*))? three: (.*)")
strapplyc(s2, "one: (.*) (two: (.*))? three: (.*)")
Upvotes: 0
Views: 130
Reputation: 469
Perhaps the problem is that the .*
after the one:
is also consuming the two:
part and the text after it. So for example the the matching groups in your line would be
1: "bla 1 two: bla2"
2: [empty]
3: "bla3"
You could fix this by making the first asterisk non-greedy with a question mark.
Some other points: I think you should put the space inside the parentheses in the two:
part, otherwise when it is not available there will have to be two spaces between the one:
and two:
part.
Additionally, for a minor tidy up, you could make the parentheses around the optional part non-capturing with with ?:
. You only want to capture three things, and the parentheses around the two:
part are just for the precedence, so it's not necessary to capture.
So altogether you would have something like this:
strapplyc(s1, "one: (.*?)(?: two: (.*))? three: (bla3)")
Upvotes: 2