Reputation: 4180
I am trying to use regular expression to find specific strings between parentheses in a string like the one below:
foo = '((peach W/O juice) OR apple OR (pear W/O water) OR kiwi OR (lychee AND sugar) OR (pineapple W/O salt))'
Specifically, I want to find only (peach W/O juice)
, (pear W/O water)
, and (pineapple W/O salt)
.
I tried lookahead
and lookbehind
, but was unable to obtain the correct results.
For example, when I do the following RegEx:
import re
regex = '(?<=[\s\(])\([^\)].*\sW/O\s[^\)].*\)(?=[\)\s])'
re.findall(regex, foo)
I end up with the entire string:
['(peach W/O juice) OR apple OR (pear W/O water) OR kiwi OR (lychee AND sugar) OR (pineapple W/O salt)']
I found the problem:
Instead of [\)].*
, I should do [\)]*
, which would give me the correct result:
regex = '(?<=[\s\(])\([^\)]*\sW/O\s[^\)]*\)(?=[\)\s])'
re.findall(regex, foo)
['(peach W/O juice)', '(pear W/O water)', '(pineapple W/O salt)']
Upvotes: 1
Views: 126
Reputation: 89614
Since you want to include parenthesis in the result, you don't need to use lookarounds. You can use a character class that exclude the closing parenthesis. In this way, you are sure that W/O is between parenthesis:
re.findall(r'\([^()]* W/O [^)]*\)', foo)
Upvotes: 1
Reputation: 9633
I think your problem is that your .*
operators are being greedy - they will consume as much as they can if you don't put a ?
after them: .*?
. Also, note that since you want the parentheses, you shouldn't need the lookahead/lookbehind operations; they will exclude the parentheses they find.
Instead of fully debugging your regex, I decided to just rewrite it:
>>> import re
>>> foo ='((peach W/O juice) OR apple OR (pear W/O water) OR kiwi OR (lychee AND sugar) OR (pineapple W/O salt))'
>>> regex = '\([a-zA-Z ]*?W/O.*?\)'
>>> re.findall(regex, foo)
['(peach W/O juice)', '(pear W/O water)', '(pineapple W/O salt)']
Here's the breakdown:
\(
captures the leading parentheses - note that it's escaped
[a-zA-Z ]
captures all alphabetical characters and a space (note the space after Z before the closing bracket) I used this instead of .
so that no other parentheses will be captured. Using the period operator would cause (lychee AND sugar) OR (pineapple W/O salt)
to be captured as one match.
*?
the *
causes the characters in the bracket to match 0 or more times, but the ?
says to only capture as many as you need to make a match
W/O
captures the "W/O" that you're looking for
.*?
captures any more characters (again, non-greedy because of ?
)
\)
captures the trailing parenthesese
Upvotes: 3