Alex
Alex

Reputation: 4180

Using regular expression to find specific strings between parentheses (including parentheses)

I am trying to use regular expression to find specific strings between parentheses in a string like the one below:

foo = '((peach W/O juice) OR apple OR (pear W/O water) OR kiwi OR (lychee AND sugar) OR (pineapple W/O salt))'

Specifically, I want to find only (peach W/O juice), (pear W/O water), and (pineapple W/O salt).

I tried lookahead and lookbehind, but was unable to obtain the correct results.

For example, when I do the following RegEx:

import re
regex = '(?<=[\s\(])\([^\)].*\sW/O\s[^\)].*\)(?=[\)\s])'
re.findall(regex, foo)

I end up with the entire string:

['(peach W/O juice) OR apple OR (pear W/O water) OR kiwi OR (lychee AND sugar) OR (pineapple W/O salt)']

EDIT:

I found the problem:

Instead of [\)].*, I should do [\)]*, which would give me the correct result:

regex = '(?<=[\s\(])\([^\)]*\sW/O\s[^\)]*\)(?=[\)\s])'

re.findall(regex, foo)
['(peach W/O juice)', '(pear W/O water)', '(pineapple W/O salt)']

Upvotes: 1

Views: 126

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89614

Since you want to include parenthesis in the result, you don't need to use lookarounds. You can use a character class that exclude the closing parenthesis. In this way, you are sure that W/O is between parenthesis:

re.findall(r'\([^()]* W/O [^)]*\)', foo)

Upvotes: 1

skrrgwasme
skrrgwasme

Reputation: 9633

I think your problem is that your .* operators are being greedy - they will consume as much as they can if you don't put a ? after them: .*?. Also, note that since you want the parentheses, you shouldn't need the lookahead/lookbehind operations; they will exclude the parentheses they find.

Instead of fully debugging your regex, I decided to just rewrite it:

>>> import re
>>> foo ='((peach W/O juice) OR apple OR (pear W/O water) OR kiwi OR (lychee AND sugar) OR (pineapple W/O salt))'
>>> regex = '\([a-zA-Z ]*?W/O.*?\)'
>>> re.findall(regex, foo)
['(peach W/O juice)', '(pear W/O water)', '(pineapple W/O salt)']

Here's the breakdown:

\( captures the leading parentheses - note that it's escaped

[a-zA-Z ] captures all alphabetical characters and a space (note the space after Z before the closing bracket) I used this instead of . so that no other parentheses will be captured. Using the period operator would cause (lychee AND sugar) OR (pineapple W/O salt) to be captured as one match.

*? the * causes the characters in the bracket to match 0 or more times, but the ? says to only capture as many as you need to make a match

W/O captures the "W/O" that you're looking for

.*? captures any more characters (again, non-greedy because of ?)

\) captures the trailing parenthesese

Upvotes: 3

Related Questions