Reputation: 141
I need to parse a large text file looking for patterns like this:
string='Path Group: sclk ;djlasfhv slack 5t45545 545 (VIOLATED) -0.8568 Path Group: sclk ;djlasfhv slack (VIOLATED) -0.88 Path Group: sclkasfhv slack (VIOLATED) -0.121'
violation = re.findall('Path Group: sclk.*VIOLATED\)\s*(-[0-9]\.[0-9]+)', string)
This prints just the last -0.121. I am expecting [-0.8568, -0.88, -0.121]
P.S. I have just given a sample string here. The actual file is pretty huge. I am looking for all -ve numbers following the string in my regexp.
Whats the mistake I am making?
Upvotes: 1
Views: 602
Reputation: 626729
A better and a more effecient regex than lazy matching is
violation = re.findall(r'Path Group: sclk[^V]*(?:V(?!IOLATED\))[^V]*)*VIOLATED\)\s*(-[0-9]\.[0-9]+)', string)
See IDEONE demo
See regex demo
It is based on Jeffrey E. F. Friedl's "unrolling the loop" method, and basically matches the same as the version with a lazy matching, but is more effecient:
Path Group: sclk
- literal sequence of characters[^V]*(?:V(?!IOLATED\))[^V]*)*VIOLATED\)
- anything up to the first `VIOLATED)\s*
- 0 or more whitespace(-[0-9]\.[0-9]+)
- captures into Group 1 -
, followed with a digit, then a period, and then 1 or more digits.Upvotes: 2
Reputation: 174696
Make it non-greedy. .*
present in your regex is greedy which matches all the characters as much as possible. So this would match all and the -[0-9]\.[0-9]+
pattern should capture the last number.
violation = re.findall(r'Path Group: sclk.*?VIOLATED\)\s*(-[0-9]\.[0-9]+)', string)
^
Example:
>>> import re
>>> string='Path Group: sclk ;djlasfhv slack 5t45545 545 (VIOLATED) -0.8568 Path Group: sclk ;djlasfhv slack (VIOLATED) -0.88 Path Group: sclkasfhv slack (VIOLATED) -0.121'
>>> re.findall(r'Path Group: sclk.*?VIOLATED\)\s*(-[0-9]\.[0-9]+)', string)
['-0.8568', '-0.88', '-0.121']
Upvotes: 2