Reputation: 2061
For some reason, I have to use the non-greedy mode of the regex in Python. Here is the code(which might look a bit weird to you):
import re
# the string
s = u"<n-0><LBRACKET-1><eng-2><RBRACKET-3><n-4><n-5><v-6><n-7><m-8><GRAM-9><PU-10><n-11><LBRACKET-12><n-13><n-14><RBRACKET-15><m-16><GRAM-17>"
# the pattern
p = ur"(?P<name>(?:<n-\d+>)+(<LBRACKET-\d+>.*?<RBRACKET-\d+>)?)(?P<amount><m-\d+>)(?P<measure><GRAM-\d+>)"
tmp = re.search(p, s).group()
The result is the whole string s
, but I want the result to be <n-7><m-8><GRAM-9>
and <n-11><LBRACKET-12><n-13><n-14><RBRACKET-15><m-16><GRAM-17>
I think it is something to do with the non-greedy mode of regex. Could anybody point out where I am being wrong?
Upvotes: 3
Views: 236
Reputation: 11347
I think this is what you're looking for:
p = ur"(?P<name>(?:<n-\d+>)+(<LBRACKET-\d+>((?<!RBRACKET).)*?<RBRACKET-\d+>)?)(?P<amount><m-\d+>)(?P<measure><GRAM-\d+>)"
Explanation: the problem with your original pattern is the catch-all fragment .*?
between LBRACKET
and RBRACKET
. Yes, it's non-greedy, but greediness only applies when the engine has a choice between two or more matches. In your pattern, there's no choice, because there's only one RBRACKET
followed by <m...>
. Therefore, it matches <n-0><LBRACKET-1>...<RBRACKET-15>
and doesn't look any further there because it's a valid (and shortest) match. By adding a negative lookbehind, we explicitly tell the engine that the .*?
shouldn't contain RBRACKET
thus forcing it to try more combinations.
Upvotes: 1