So8res
So8res

Reputation: 10386

Python non-greedy regexes

How do I make a python regex like "(.*)" such that, given "a (b) c (d) e" python matches "b" instead of "b) c (d"?

I know that I can use "[^)]" instead of ".", but I'm looking for a more general solution that keeps my regex a little cleaner. Is there any way to tell python "hey, match this as soon as possible"?

Upvotes: 291

Views: 187665

Answers (8)

ojrac
ojrac

Reputation: 13421

Using an ungreedy match is a good start, but I'd also suggest that you reconsider any use of .* -- what about this?

groups = re.search(r"\([^)]*\)", x)

Upvotes: 10

In Hoc Signo
In Hoc Signo

Reputation: 495

To start with, I do not suggest using "*" in regexes. Yes, I know, it is the most used multi-character delimiter, but it is nevertheless a bad idea. This is because, while it does match any amount of repetition for that character, "any" includes 0, which is usually something you want to throw a syntax error for, not accept. Instead, I suggest using the + sign, which matches any repetition of length ≥ 1. What's more, from what I can see, you are dealing with fixed-length parenthesized expressions. As a result, you can probably use the {x, y} syntax to specifically specify the desired length.

However, if you really do need non-greedy repetition, I suggest consulting the all-powerful ?. This, when placed after at the end of any regex repetition specifier, will force that part of the regex to find the least amount of text possible.

That being said, I would be very careful with the ? as it, like the Sonic Screwdriver in Dr. Who, has a tendency to do, how should I put it, "slightly" undesired things if not carefully calibrated. For example, to use your example input, it would identify ((1) (note the lack of a second rparen) as a match.

Upvotes: 0

Upzilla
Upzilla

Reputation: 21

You can modify your regex pattern to use a non-greedy quantifier. Instead of (.*), you can use (.*?).

Here's an explanation:

  1. * is a greedy quantifier which matches as much as possible (including parentheses in your case) until the last occurrence of ).

  2. *? is a non-greedy (or lazy) version of .*, which matches as little as possible while still allowing the overall pattern to match. It stops as soon as the subsequent part of the regex pattern can match.

Therefore, your regex pattern can be adjusted to (.*?) like this:

import re

input_string = "a (b) c (d) e"
pattern = r'\((.*?)\)'
matches = re.findall(pattern, input_string)

print(matches)  # Output: ['b', 'd']

In this modified pattern r'\((.*?)\)', we're matching substrings inside parentheses () in a non-greedy way. The .*? part ensures that the regex engine stops capturing characters as soon as it encounters the first closing parenthesis ), thus giving you the desired result of matching only the content inside each pair of parentheses.

Upvotes: 2

Trey Stout
Trey Stout

Reputation: 6911

You seek the all-powerful *?

From the docs, Greedy versus Non-Greedy

the non-greedy qualifiers *?, +?, ??, or {m,n}? [...] match as little text as possible.

Upvotes: 416

Zitrax
Zitrax

Reputation: 20334

Would not \\(.*?\\) work? That is the non-greedy syntax.

Upvotes: 18

Chas. Owens
Chas. Owens

Reputation: 64939

As the others have said using the ? modifier on the * quantifier will solve your immediate problem, but be careful, you are starting to stray into areas where regexes stop working and you need a parser instead. For instance, the string "(foo (bar)) baz" will cause you problems.

Upvotes: 6

David Berger
David Berger

Reputation: 12823

Do you want it to match "(b)"? Do as Zitrax and Paolo have suggested. Do you want it to match "b"? Do

>>> x = "a (b) c (d) e"
>>> re.search(r"\((.*?)\)", x).group(1)
'b'

Upvotes: 7

Paolo Bergantino
Paolo Bergantino

Reputation: 488704

>>> x = "a (b) c (d) e"
>>> re.search(r"\(.*\)", x).group()
'(b) c (d)'
>>> re.search(r"\(.*?\)", x).group()
'(b)'

According to the docs:

The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behavior isn’t desired; if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'.

Upvotes: 95

Related Questions