Reputation: 41
Given the string
S = "(45171924,-1,'AbuseFilter/658',2600),(43795362,-1,'!!_(disambiguation)',2600),(45795362,-1,'!!_(disambiguation)',2699)"
I'd like to extract everything within the parentheses UNLESS the parens are inside a quotation. So far I've managed to get everything within parentheses, but I can't figure out how to stop from splitting on the inner parenthesis inside the quotes. My current code is:
import re
S = "(45171924,-1,'AbuseFilter/658',2600),(43795362,-1,'!!_(disambiguation)',2600),(45795362,-1,'!!_(disambiguation)',2699)"
p = re.compile( "\((.*?)\)" )
m =p.findall(S)
for element in m:
print element
What I want is:
45171924,-1,'AbuseFilter/658',2600
43795362,-1,'!!_(disambiguation)',2600
45795362,-1,'!!_(disambiguation)',2699
What I currently get is:
45171924,-1,'AbuseFilter/658',2600
43795362,-1,'!!_(disambiguation
45795362,-1,'!!_(disambiguation
What can I do in order to ignore the internal paren?
Thank you!!
In case it helps, here are the threads I've looked at:
1) REGEX-String and escaped quote
2) Regular expression to return text between parenthesis
3)Get the string within brackets in Python
Upvotes: 4
Views: 1951
Reputation: 174706
You could use the below regex.
>>> import re
>>> s = "(45171924,-1,'AbuseFilter/658',2600),(43795362,-1,'!!_(disambiguation)',2600),(45795362,-1,'!!_(disambiguation)',2699)"
>>> for i in re.findall(r"\(((?:'[^']*'|[^()])*)\)", s):
print(i)
45171924,-1,'AbuseFilter/658',2600
43795362,-1,'!!_(disambiguation)',2600
45795362,-1,'!!_(disambiguation)',2699
Explanation:
\(
- Matches a literal ( symbol.(
- Start of a capturing group.(?:'[^']*'|[^()])*
- '[^']*' part matches greedily the single quoted block. If there is any (
, )
symbols present inside that, it won't care about that. Because we used [^']*
which matches any character but not of '
, zero or more times. If the following character is not the start of a single quoted block then the control transfers to the pattern which exists next to the |
symbol ie, [^()]
which matches any character but not of (
or )
. So the whole (?:'[^']*'|[^()])*
will match a single quoted block or any char not of (
, )
, zero or more times.)
end of the capturing group.\)
literal ) symbol.Upvotes: 1
Reputation: 70732
You can use a non-capturing group to assert either a comma or the end of the string follows:
p = re.compile(r'\((.*?)\)(?:,|$)')
Upvotes: 3
Reputation: 3089
Some simple approach would be negative lookahead - check that after closing brace no quote follows, e.g.
import re
S = "(45171924,-1,'AbuseFilter/658',2600),(43795362,-1,'!!_(disambiguation)',2600),(45795362,-1,'!!_(disambiguation)',2699)"
m = re.findall(r'\((.*?)\)(?![\'])', S)
for element in m:
print element
prints
45171924,-1,'AbuseFilter/658',2600
43795362,-1,'!!_(disambiguation)',2600
45795362,-1,'!!_(disambiguation)',2699
http://www.codeskulptor.org/#user39_CL89xhroV0_0.py
I have put the quote in character class (square brackets), so that you could add other symbols, which should make the closing bracket being ignored.
Upvotes: 0