johnbayesa
johnbayesa

Reputation: 41

Regex that captures a group with a positive lookahead but doesn't match a pattern

Using regex (Python) I want to capture a group \d-.+? that is immediately followed by another pattern \sLEFT|\sRIGHT|\sUP.

Here is my test set (from http://nflsavant.com/about.php):

(9:03) (SHOTGUN) 30-J.RICHARD LEFT GUARD PUSHED OB AT MIA 9 FOR 18 YARDS (29-BR.JONES; 21-E.ROWE).
(1:06) 69-R.HILL REPORTED IN AS ELIGIBLE.  33-D.COOK LEFT GUARD TO NO 4 FOR -3 YARDS (56-D.DAVIS; 93-D.ONYEMATA).
(3:34) (SHOTGUN) 28-R.FREEMAN LEFT TACKLE TO LAC 37 FOR 6 YARDS (56-K.MURRAY JR.).
(1:19) 22-L.PERINE UP THE MIDDLE TO CLE 43 FOR 2 YARDS (54-O.VERNON; 51-M.WILSON).

My best attempt is (\d*-.+?)(?=\sLEFT|\sRIGHT|\sUP), which works unless other characters appear between a matching capture group and my positive lookahead. In the second line of my test set this expression captures "69-R.HILL REPORTED IN AS ELIGIBLE. 33-D.COOK." instead of the desired "33-D.COOK".

My inputs are also saved on regex101, here: https://regex101.com/r/tEyuiJ/1

How can I modify (or completely rewrite) my regex to only capture the group immediately followed by my exact positive lookahead with no extra characters between?

Upvotes: 2

Views: 91

Answers (4)

bobble bubble
bobble bubble

Reputation: 18515

To prevent skipping over digits, use \D non-digit (upper is negated \d).

\b(\d+-\D+?)\s(?:LEFT|RIGHT|UP)

See this demo at regex101


Further added a word boundary and changed the lookahead to a group.

Upvotes: 3

SaSkY
SaSkY

Reputation: 1086

Try this:

\b\d+-[^\r \n]+(?= +(?:LEFT|RIGHT|UP)\b)

\b\d+-[^\r \n]+

  • \b word boundary to ignore things like foo30-J.RICHARD

  • \d+ match one or more digit.

  • - match a literal -.

  • [^\r \n]+ match on or more character except \r, \n and a literal space . Excluding \r and \n helps us not to cross newlines, and that is why \s is not used(i.e., it matches \r and \n too)

(?= +(?:LEFT|RIGHT|UP)\b) Using positive lookahead.

  • + Ensure there is one or more literal space .
  • (?:LEFT|RIGHT|UP)\b using non-caputring group, ensure our previous space followed by one of these words LEFT, RIGHT or UP. \b word boundary to ignore things like RIGHTfoo or LEFTbar.

See regex demo

Upvotes: 1

The fourth bird
The fourth bird

Reputation: 163362

If you want a capture group without any lookarounds:

\b(\d+-\S*)\s(?:LEFT|RIGHT|UP)\b

Explanation

  • \b A word boundary to prevent a partial word match
  • (\d+-\S*) Capture group 1, match 1+ digits - and optional non whitespace characters
  • \s Match a single whitespace character
  • (?:LEFT|RIGHT|UP) Match any of the alternatives
  • \b A word boundary

See the capture group value on regex101.

Upvotes: 3

K.Dᴀᴠɪs
K.Dᴀᴠɪs

Reputation: 10139

This is why you should be careful about using . to match anything and everything unless it's absolutely necessary. From the example you provided, it appears that what you're actually wanting to capture contains no spaces, thus we could utilize a negative character class [^\s] or alternatively more precisely [\w.], with either case using a * quantifier.

Your end result would look like "(\d*-[\w.]*)(?=\sLEFT|\sRIGHT|\sUP)"gm. And of course, when . is within the character class it's treated as a literal string - so it's not required to be escaped.

See it live at regex101.com

Upvotes: 2

Related Questions