Reputation: 21
I have this string:
I have an eraser and 2 pencils.
Jane has a ruler and a stapler.
I need to get all the items that I have (lines starting with I have
). I have tried these expressions:
(?:I have|and)\h+((?:a|an|\d+)\h+(?:\w+))
# returns some of the items that Jane has.
(I have )(?(1)((?:a|an|\d+) \w+))
# returns only the word closest to the beginning of the string.
I'm looking for a way to match a given string/expression at the beginning of the line or somewhere before the capturing group. Thanks in advance.
Note: I'm working with PCRE
Upvotes: 2
Views: 88
Reputation: 6521
This is a typical case for anchoring at the end of previous match with \G
.
We're trying to match some text followed by an unknown number of tokens, and it needs to capture each token individually. The regex engine is totally capable of repeating a construct to match repeating token, but each backreference must be defined on its own. Therefore, repeating a capturing group ends up overwriting its stored value and returning only the last matched value. This task may be achieved by 2 different strategies: either capturing all tokens with 1 pattern and then using a second pattern match to split them, or performing one full match for each token.
Instead of trying to get all the items "I have" in the same match, we're going to attempt to match once per item. This approach was also tried with some of the patterns proposed in the comments. However, as you may have realized, the regex engine also matches from the middle of the string, and thus matching unwanted cases like:
She has >>
a turtle
<< ...
This is where we can use an anchor like \G
. Our strategy will be:
^I have
and capture 1 item (the match ends here).Now, this can be translated to regex:
^I have an?
+ the token
an
or a
.the token
construct later.\G(?!^)(?: and)? an?
+ the token
\G
matches a zero-width position at the end of previous match. This is how the regex engine won't attempt a match anywhere in the string.\G
also matches at the beggining of the string, and we don't want to match the string "an item...
". There's a trick: we used the negative lookahead (?!^)
to specify "it's not followed by the start of the text". Therefore, it's guaranteed to match where it left off from the previous match in (1).(?: and)?
is optional, so it may or may not be there.an?
matches the article (an
/a
).Do you see that both end up with the same construct? if we join the 2 options together:
(?:^I have:?|(?!^)\G(?: and)?) an? <<the token>>
Let's talk about the token. If it were only one word, we'd use \w+
. That's not the case. Neither is .*
because it shouldn't match the whole string. And we can't consume part of the following token because otherwise it wouldn't be returned in the next match.
I have a new eraser and a pencil
^
|
How does it stop here?!
How do we define a condition not to allow a match beyond that position?
It's not followed by
a
/an
/and
!!!
This can be achieved by a negative lookahead, to guarantee it's not followed by a
/an
/and
before we match a character: (?! a | an | and ).
. As you can imagine, that construct will be repeated to match every one of the characters in a token.
This pattern matches what we want: (?:(?! and | an? ).)+
And one more thing, we'll use a capturing group around it to be able to extract the text.
the token
= ((?:(?! and | an? ).)+)
We now have the first working version of the regex. Put together:
(?:^I have:?|(?!^)\G(?: and)?) an? ((?:(?! and | an? ).)+)
Following the same principle, this approach allows us to include more conditions to the match. For instance,
I have
" could be followed by any word, not necessarily an article, using exceptions.What to choose depends on the subjet text, and it should be tested with several examples and corrected until it works as desired.
This is the pattern I'd personally use in this case:
(?: # SUBPATTERN 1
\bI have:? # "I have"
(?\b) # not followed by to|been|\w+[en]d
| # or
(?!\A)\G[ ] # anchored to previous match
?,?(?:[ ]?and)? # optional comma or "and"
) #
#
[ ](?:(?:an?|some)[ ])? # ARTICLE: a|an|some
#
\K # \K (reset match)
#
(?: # SUBPATTERN 2
(?! # Negative lookahead (exceptions)
[ ]*, # a. Comma to list another item
| # b. Article (a|an), some
[ ](?:a(?:nd?)?|some)\b # or and
) #
. # MATCH each character in a token
)+ # REPEAT Subpattern 2
One-liner:
(?:\bI have:?(?! (?:to|been|\w+?[en]d)\b)|(?!\A)\G ?,?(?: ?and)?) (?:(?:an?|some) )?\K(?:(?! *,| (?:a(?:nd?)?|some)\b).)+
However, it should be tested to identify exceptions and use cases. This is how it behaves with the examples discussed in this post.
Matching the subject text:
Each match has been marked.
I have an
eraser
, apencil
and anitem
She has aturtle
and acar
I have anawesome motorcycle tatoo
and abag
I have to say I have atrain
and abicycle
I have3 bricks
and4 knees
and atie
Notice these are full matches, and not the value returned by a group. Simply add a group to enclose the "subpattern 2" to capture the tokens.
Upvotes: 1
Reputation: 23098
It's still tricky do have a variable number of groups, but you can try this:
I have (?:an |a )?(\d? ?\w+)(\(?: and (?:an |a )?(\d? ?\w+))?(?: and (?:an |a )?(\d? ?\w+))?(?: and (?:an |a )?(\d? ?\w+))?
Below are some sample results:
"I have an eraser and a pencil and an item" -> ["eraser", "pencil", "item"]
"She has a turtle and a car" -> []
"I have 3 bricks and 4 knees and a tie" -> ["3 bricks", "4 knees", "tie"]
"I have a motorcycle and a bag" -> ["motorcycle", "bag"]
"I have a journal" -> ["journal"]
"I have wires and tires" -> ["wires", "tires"]
"I must say I have a train and a bicycle" -> ["train", "bicycle"]
For each line, it will capture a maximum number of 3 items.
Upvotes: 1