Reputation: 41
Using regex (Python) I want to capture a group \d-.+?
that is immediately followed by another pattern \sLEFT|\sRIGHT|\sUP
.
Here is my test set (from http://nflsavant.com/about.php):
(9:03) (SHOTGUN) 30-J.RICHARD LEFT GUARD PUSHED OB AT MIA 9 FOR 18 YARDS (29-BR.JONES; 21-E.ROWE).
(1:06) 69-R.HILL REPORTED IN AS ELIGIBLE. 33-D.COOK LEFT GUARD TO NO 4 FOR -3 YARDS (56-D.DAVIS; 93-D.ONYEMATA).
(3:34) (SHOTGUN) 28-R.FREEMAN LEFT TACKLE TO LAC 37 FOR 6 YARDS (56-K.MURRAY JR.).
(1:19) 22-L.PERINE UP THE MIDDLE TO CLE 43 FOR 2 YARDS (54-O.VERNON; 51-M.WILSON).
My best attempt is (\d*-.+?)(?=\sLEFT|\sRIGHT|\sUP)
, which works unless other characters appear between a matching capture group and my positive lookahead. In the second line of my test set this expression captures "69-R.HILL REPORTED IN AS ELIGIBLE. 33-D.COOK." instead of the desired "33-D.COOK".
My inputs are also saved on regex101, here: https://regex101.com/r/tEyuiJ/1
How can I modify (or completely rewrite) my regex to only capture the group immediately followed by my exact positive lookahead with no extra characters between?
Upvotes: 2
Views: 91
Reputation: 18515
To prevent skipping over digits, use \D
non-digit (upper is negated \d
).
\b(\d+-\D+?)\s(?:LEFT|RIGHT|UP)
Further added a word boundary and changed the lookahead to a group.
Upvotes: 3
Reputation: 1086
Try this:
\b\d+-[^\r \n]+(?= +(?:LEFT|RIGHT|UP)\b)
\b\d+-[^\r \n]+
\b
word boundary to ignore things like foo30-J.RICHARD
\d+
match one or more digit.
-
match a literal -
.
[^\r \n]+
match on or more character except \r
, \n
and a literal space
. Excluding \r
and \n
helps us not to cross newlines, and that is why \s
is not used(i.e., it matches \r
and \n
too)
(?= +(?:LEFT|RIGHT|UP)\b)
Using positive lookahead.
+
Ensure there is one or more literal space
.(?:LEFT|RIGHT|UP)\b
using non-caputring group, ensure our previous space
followed by one of these words LEFT
, RIGHT
or UP
. \b
word boundary to ignore things like RIGHTfoo
or LEFTbar
.See regex demo
Upvotes: 1
Reputation: 163362
If you want a capture group without any lookarounds:
\b(\d+-\S*)\s(?:LEFT|RIGHT|UP)\b
Explanation
\b
A word boundary to prevent a partial word match(\d+-\S*)
Capture group 1, match 1+ digits -
and optional non whitespace characters\s
Match a single whitespace character(?:LEFT|RIGHT|UP)
Match any of the alternatives\b
A word boundarySee the capture group value on regex101.
Upvotes: 3
Reputation: 10139
This is why you should be careful about using .
to match anything and everything unless it's absolutely necessary. From the example you provided, it appears that what you're actually wanting to capture contains no spaces, thus we could utilize a negative character class [^\s]
or alternatively more precisely [\w.]
, with either case using a *
quantifier.
Your end result would look like "(\d*-[\w.]*)(?=\sLEFT|\sRIGHT|\sUP)"gm
. And of course, when .
is within the character class it's treated as a literal string - so it's not required to be escaped.
See it live at regex101.com
Upvotes: 2