Shuai Zhang
Shuai Zhang

Reputation: 2061

Non-greedy mode in Python re module

For some reason, I have to use the non-greedy mode of the regex in Python. Here is the code(which might look a bit weird to you):

import re
# the string
s = u"<n-0><LBRACKET-1><eng-2><RBRACKET-3><n-4><n-5><v-6><n-7><m-8><GRAM-9><PU-10><n-11><LBRACKET-12><n-13><n-14><RBRACKET-15><m-16><GRAM-17>"
# the pattern
p = ur"(?P<name>(?:<n-\d+>)+(<LBRACKET-\d+>.*?<RBRACKET-\d+>)?)(?P<amount><m-\d+>)(?P<measure><GRAM-\d+>)"
tmp = re.search(p, s).group()

The result is the whole string s, but I want the result to be <n-7><m-8><GRAM-9> and <n-11><LBRACKET-12><n-13><n-14><RBRACKET-15><m-16><GRAM-17>

I think it is something to do with the non-greedy mode of regex. Could anybody point out where I am being wrong?

Upvotes: 3

Views: 236

Answers (1)

gog
gog

Reputation: 11347

I think this is what you're looking for:

p = ur"(?P<name>(?:<n-\d+>)+(<LBRACKET-\d+>((?<!RBRACKET).)*?<RBRACKET-\d+>)?)(?P<amount><m-\d+>)(?P<measure><GRAM-\d+>)"

Explanation: the problem with your original pattern is the catch-all fragment .*? between LBRACKET and RBRACKET. Yes, it's non-greedy, but greediness only applies when the engine has a choice between two or more matches. In your pattern, there's no choice, because there's only one RBRACKET followed by <m...>. Therefore, it matches <n-0><LBRACKET-1>...<RBRACKET-15> and doesn't look any further there because it's a valid (and shortest) match. By adding a negative lookbehind, we explicitly tell the engine that the .*? shouldn't contain RBRACKET thus forcing it to try more combinations.

Upvotes: 1

Related Questions