Reputation: 121
So i have the following...
temp = 'item 8 but i want this item 8 and Financial Statements and Supplementary Data'
pattern_8 = r'ITEM 8.*?Financial Statements and Supplementary Data'
Then I do...
re.search(pattern_8,temp,re.IGNORECASE)
<re.Match object; span=(0, 77), match='item 8 but i want this item 8 and Financial State>
But atleats for me it takes the first 'item 8' rather than the second. I guess I could loop the search over itself until it stops.. but there has to be a reason this non-greedy matching isn't working?
Upvotes: 0
Views: 55
Reputation: 27360
Your result is to be expected. I think you misunderstood what non-greedy means. I does not mean »make the whole regex match the shortest string«, but just that the .
after item 8
is matched as few times as possible until you encounter Financial ...
. This ensures that you pick the first Financial ...
, but does not ensure that you pick the last item 8
.
The starting point of the search for Financial ...
is unaffected by the ?
modifier. You could say item 8
is greedy, since it will match the first item 8
in your string as long as there is a Financial ...
after that.
To get the shortest match, you can ensure that item 8
never occurs inside the matched part of .*?
.
item 8((?!item 8).)*?Financial Statements and Supplementary Data
Upvotes: 1
Reputation: 121
The most up to date regex package (not re) in Python has an overlap option so i can do this...
import regex as re
re.findall(pattern_8, temp, re.IGNORECASE, overlapped=True)
[(m.start(0), m.end(0)) for m in re.finditer(pattern_8, temp,re.IGNORECASE, overlapped=True)]
Out[161]: [(0, 77), (23, 77)]
Using the overlapped function gives me both matches very quickly.
Upvotes: 0