ben shapiro
ben shapiro

Reputation: 121

non-greedy regex

So i have the following...

temp = 'item 8 but i want this item 8 and Financial Statements and Supplementary Data'
pattern_8  = r'ITEM 8.*?Financial Statements and Supplementary Data'

Then I do...

re.search(pattern_8,temp,re.IGNORECASE)
<re.Match object; span=(0, 77), match='item 8 but i want this item 8 and Financial State>

But atleats for me it takes the first 'item 8' rather than the second. I guess I could loop the search over itself until it stops.. but there has to be a reason this non-greedy matching isn't working?

Upvotes: 0

Views: 55

Answers (2)

Socowi
Socowi

Reputation: 27360

Your result is to be expected. I think you misunderstood what non-greedy means. I does not mean »make the whole regex match the shortest string«, but just that the . after item 8 is matched as few times as possible until you encounter Financial .... This ensures that you pick the first Financial ..., but does not ensure that you pick the last item 8.

The starting point of the search for Financial ... is unaffected by the ? modifier. You could say item 8 is greedy, since it will match the first item 8 in your string as long as there is a Financial ... after that.

To get the shortest match, you can ensure that item 8 never occurs inside the matched part of .*?.

item 8((?!item 8).)*?Financial Statements and Supplementary Data

Upvotes: 1

ben shapiro
ben shapiro

Reputation: 121

The most up to date regex package (not re) in Python has an overlap option so i can do this...

import regex as re
re.findall(pattern_8, temp, re.IGNORECASE, overlapped=True)
[(m.start(0), m.end(0)) for m in re.finditer(pattern_8, temp,re.IGNORECASE, overlapped=True)]
Out[161]: [(0, 77), (23, 77)]

Using the overlapped function gives me both matches very quickly.

Upvotes: 0

Related Questions