Gaurav Gupte
Gaurav Gupte

Reputation: 141

re.findall not matching all matches

I need to parse a large text file looking for patterns like this:

string='Path Group: sclk ;djlasfhv slack 5t45545 545 (VIOLATED)    -0.8568      Path Group: sclk ;djlasfhv slack (VIOLATED)       -0.88 Path Group: sclkasfhv slack (VIOLATED)                -0.121'
violation = re.findall('Path Group: sclk.*VIOLATED\)\s*(-[0-9]\.[0-9]+)', string)

This prints just the last -0.121. I am expecting [-0.8568, -0.88, -0.121]

P.S. I have just given a sample string here. The actual file is pretty huge. I am looking for all -ve numbers following the string in my regexp.

Whats the mistake I am making?

Upvotes: 1

Views: 602

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626729

A better and a more effecient regex than lazy matching is

violation = re.findall(r'Path Group: sclk[^V]*(?:V(?!IOLATED\))[^V]*)*VIOLATED\)\s*(-[0-9]\.[0-9]+)', string)

See IDEONE demo

See regex demo

It is based on Jeffrey E. F. Friedl's "unrolling the loop" method, and basically matches the same as the version with a lazy matching, but is more effecient:

  • Path Group: sclk - literal sequence of characters
  • [^V]*(?:V(?!IOLATED\))[^V]*)*VIOLATED\) - anything up to the first `VIOLATED)
  • \s* - 0 or more whitespace
  • (-[0-9]\.[0-9]+) - captures into Group 1 -, followed with a digit, then a period, and then 1 or more digits.

Upvotes: 2

Avinash Raj
Avinash Raj

Reputation: 174696

Make it non-greedy. .* present in your regex is greedy which matches all the characters as much as possible. So this would match all and the -[0-9]\.[0-9]+ pattern should capture the last number.

violation = re.findall(r'Path Group: sclk.*?VIOLATED\)\s*(-[0-9]\.[0-9]+)', string)
                                           ^

Example:

>>> import re
>>> string='Path Group: sclk ;djlasfhv slack 5t45545 545 (VIOLATED)    -0.8568      Path Group: sclk ;djlasfhv slack (VIOLATED)       -0.88 Path Group: sclkasfhv slack (VIOLATED)                -0.121'
>>> re.findall(r'Path Group: sclk.*?VIOLATED\)\s*(-[0-9]\.[0-9]+)', string)
['-0.8568', '-0.88', '-0.121']

Upvotes: 2

Related Questions