Schemer
Schemer

Reputation: 1675

Python regex: greedy pattern returning multiple empty matches

This pattern is meant simply to grab everything in a string up until the first potential sentence boundary in the data:

[^\.?!\r\n]*

Output:

>>> pattern = re.compile(r"([^\.?!\r\n]*)")
>>> matches = pattern.findall("Australians go hard!!!") # Actual source snippet, not a personal comment about Australians. :-)
>>> print matches
['Australians go hard', '', '', '', '']

From the Python documentation:

re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

Now, if the string is scanned left to right and the * operator is greedy, it makes perfect sense that the first match returned is the whole string up to the exclamation marks. However, after that portion has been consumed, I do not see how the pattern is producing an empty match exactly four times, presumably by scanning the string leftward after the "d". I do understand that the * operator means this pattern can match the empty string, I just don't see how it would doing that more than once between the trailing "d" of the letters and the leading "!" of the punctuation.

Adding the ^ anchor has this effect:

>>> pattern = re.compile(r"^([^\.?!\r\n]*)")
>>> matches = pattern.findall("Australians go hard!!!")
>>> print matches
['Australians go hard']

Since this eliminates the empty string matches, it would seem to indicate that said empty matches were occurring before the leading "A" of the string. But that would seem to contradict the documentation with respect to the matches being returned in the order found (matches before the leading "A" should have been first) and, again, exactly four empty matches baffles me.

Upvotes: 4

Views: 2112

Answers (1)

rchang
rchang

Reputation: 5246

The * quantifier allows the pattern to capture a substring of length zero. In your original code version (without the ^ anchor in front), the additional matches are:

  • the zero-length string between the end of hard and the first !
  • the zero-length string between the first and second !
  • the zero-length string between the second and third !
  • the zero-length string between the third ! and the end of the text

You can slice/dice this further if you like here.

Adding that ^ anchor to the front now ensures that only a single substring can match the pattern, since the beginning of the input text occurs exactly once.

Upvotes: 6

Related Questions