Mika H.
Mika H.

Reputation: 4329

Python regex with *?

What does this Python regex match?

.*?[^\\]\n

I'm confused about why the . is followed by both * and ?.

Upvotes: 4

Views: 2416

Answers (3)

Andrew Clark
Andrew Clark

Reputation: 208455

* means "match the previous element as many times as possible (zero or more times)".

*? means "match the previous element as few times as possible (zero or more times)".

The other answers already address this, but what they don't bring up is how it changes the regex, well if the re.DOTALL flag is provided it makes a huge difference, because . will match line break characters with that enabled. So .*[^\\]\n would match from the beginning of the string all the way to the last newline character that is not preceeded by a backslash (so several lines would match).

If the re.DOTALL flag is not provided, the difference is more subtle, [^\\] will match everything other than backslash, including line break characters. Consider the following example:

>>> import re
>>> s = "foo\n\nbar"
>>> re.findall(r'.*?[^\\]\n', s)
['foo\n']
>>> re.findall(r'.*[^\\]\n', s)
['foo\n\n']

So the purpose of this regex is to find non-empty lines that don't end with a backslash, but if you use .* instead of .*? you will match an extra \n if you have an empty line following a non-empty line.

This happens because .*? will only match fo, [^\\] will match the second o, and the the \n matches at the end of the first line. However the .* will match foo, the [^\\] will match the \n to end the first line, and the next \n will match because the second line is blank.

Upvotes: 6

Michael
Michael

Reputation: 3332

. indicates a wild card. It can match anything except a \n, unless the appropriate flag is used.

* indicates that you can have 0 or more of the thing preceding it.

? indicates that the preceding quantifier is lazy. It will stop searching after the first match it finds.

Upvotes: 6

andrewdotn
andrewdotn

Reputation: 34813

Opening the Python re module documentation, and searching for *?, we find:

*?, +?, ??:

The *, +, and ? qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <H1>title</H1>, it will match the entire string, and not just <H1>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only <H1>.

Upvotes: 6

Related Questions