Reputation: 21

Simple trouble with regular expression

I have this string:

I have an eraser and 2 pencils.
Jane has a ruler and a stapler.

I need to get all the items that I have (lines starting with I have). I have tried these expressions:

(?:I have|and)\h+((?:a|an|\d+)\h+(?:\w+))
# returns some of the items that Jane has.

(I have )(?(1)((?:a|an|\d+) \w+))
# returns only the word closest to the beginning of the string.

I'm looking for a way to match a given string/expression at the beginning of the line or somewhere before the capturing group. Thanks in advance.

Note: I'm working with PCRE

Upvotes: 2

Answers (2)

Mariano

Reputation: 6521

This is a typical case for anchoring at the end of previous match with \G.

We're trying to match some text followed by an unknown number of tokens, and it needs to capture each token individually. The regex engine is totally capable of repeating a construct to match repeating token, but each backreference must be defined on its own. Therefore, repeating a capturing group ends up overwriting its stored value and returning only the last matched value. This task may be achieved by 2 different strategies: either capturing all tokens with 1 pattern and then using a second pattern match to split them, or performing one full match for each token.

Instead of trying to get all the items "I have" in the same match, we're going to attempt to match once per item. This approach was also tried with some of the patterns proposed in the comments. However, as you may have realized, the regex engine also matches from the middle of the string, and thus matching unwanted cases like:

She has >>a turtle<< ...

This is where we can use an anchor like \G. Our strategy will be:

Match ^I have and capture 1 item (the match ends here).
In consecutive match, start at the end of previous match, and match 1 item.
Repeat (2) for successive matches.

Now, this can be translated to regex:

^I have an? + the token
- Literal text at the beggining of the line.
- an or a.
- And we'll cover the the token construct later.

\G(?!^)(?: and)? an? + the token
- \G matches a zero-width position at the end of previous match. This is how the regex engine won't attempt a match anywhere in the string.
- However, \G also matches at the beggining of the string, and we don't want to match the string "an item...". There's a trick: we used the negative lookahead (?!^) to specify "it's not followed by the start of the text". Therefore, it's guaranteed to match where it left off from the previous match in (1).
- (?: and)? is optional, so it may or may not be there.
- an? matches the article (an/a).

Do you see that both end up with the same construct? if we join the 2 options together:

(?:^I have:?|(?!^)\G(?: and)?) an? <<the token>>

Let's talk about the token. If it were only one word, we'd use \w+. That's not the case. Neither is .* because it shouldn't match the whole string. And we can't consume part of the following token because otherwise it wouldn't be returned in the next match.

I have a new eraser and a pencil
                   ^
                   |
        How does it stop here?!

How do we define a condition not to allow a match beyond that position?

It's not followed by a/an/and !!!

This can be achieved by a negative lookahead, to guarantee it's not followed by a/an/and before we match a character: (?! a | an | and ).. As you can imagine, that construct will be repeated to match every one of the characters in a token.

This pattern matches what we want: (?:(?! and | an? ).)+

And one more thing, we'll use a capturing group around it to be able to extract the text.

the token = ((?:(?! and | an? ).)+)

First version:

We now have the first working version of the regex. Put together:

(?:^I have:?|(?!^)\G(?: and)?) an? ((?:(?! and | an? ).)+)

Test it in regex101

A few more tricks:

Following the same principle, this approach allows us to include more conditions to the match. For instance,

Not anchored to the start of line.
Without capturing groups, returning each token by with the value of the full match.
Items can be separated with commas.
"I have" could be followed by any word, not necessarily an article, using exceptions.
etc.

What to choose depends on the subjet text, and it should be tested with several examples and corrected until it works as desired.

Solution:

This is the pattern I'd personally use in this case:

(?:                                         # SUBPATTERN 1
    \bI have:?                              #  "I have"
    (?![ ](?:to|been|\w+?[en]d)\b)          #  not followed by to|been|\w+[en]d
  |                                         #   or
    (?!\A)\G[ ]                             #  anchored to previous match
    ?,?(?:[ ]?and)?                         #  optional comma or "and"
)                                           #
                                            #
[ ](?:(?:an?|some)[ ])?                     # ARTICLE: a|an|some
                                            #
\K                                          # \K (reset match) 
                                            #
(?:                                         # SUBPATTERN 2
    (?!                                     #  Negative lookahead (exceptions)
        [ ]*,                               #   a. Comma to list another item
      |                                     #   b. Article (a|an), some
        [ ](?:a(?:nd?)?|some)\b             #      or and
    )                                       #
    .                                       #  MATCH each character in a token
)+                                          # REPEAT Subpattern 2

One-liner:

(?:\bI have:?(?! (?:to|been|\w+?[en]d)\b)|(?!\A)\G ?,?(?: ?and)?) (?:(?:an?|some) )?\K(?:(?! *,| (?:a(?:nd?)?|some)\b).)+

Test in regex101

However, it should be tested to identify exceptions and use cases. This is how it behaves with the examples discussed in this post.

Matching the subject text:

Each match has been marked.

I have an eraser, a pencil and an item
She has a turtle and a car
I have an awesome motorcycle tatoo and a bag
I have to say I have a train and a bicycle
I have 3 bricks and 4 knees and a tie

Notice these are full matches, and not the value returned by a group. Simply add a group to enclose the "subpattern 2" to capture the tokens.

Test in regex101

Upvotes: 1

Jivan

Reputation: 23098

It's still tricky do have a variable number of groups, but you can try this:

I have (?:an |a )?(\d? ?\w+)(\(?: and (?:an |a )?(\d? ?\w+))?(?: and (?:an |a )?(\d? ?\w+))?(?: and (?:an |a )?(\d? ?\w+))?

Below are some sample results:

"I have an eraser and a pencil and an item"  -> ["eraser", "pencil", "item"]
"She has a turtle and a car"                 -> []
"I have 3 bricks and 4 knees and a tie"      -> ["3 bricks", "4 knees", "tie"]
"I have a motorcycle and a bag"              -> ["motorcycle", "bag"]
"I have a journal"                           -> ["journal"]
"I have wires and tires"                     -> ["wires", "tires"]
"I must say I have a train and a bicycle"    -> ["train", "bicycle"]

For each line, it will capture a maximum number of 3 items.