MayaR
MayaR

Reputation: 41

Python Regex get everything within parentheses unless in quotes

Given the string

S = "(45171924,-1,'AbuseFilter/658',2600),(43795362,-1,'!!_(disambiguation)',2600),(45795362,-1,'!!_(disambiguation)',2699)"

I'd like to extract everything within the parentheses UNLESS the parens are inside a quotation. So far I've managed to get everything within parentheses, but I can't figure out how to stop from splitting on the inner parenthesis inside the quotes. My current code is:

import re
S = "(45171924,-1,'AbuseFilter/658',2600),(43795362,-1,'!!_(disambiguation)',2600),(45795362,-1,'!!_(disambiguation)',2699)"

p = re.compile( "\((.*?)\)" )
m =p.findall(S)
for element in m:
    print element

What I want is:

45171924,-1,'AbuseFilter/658',2600
43795362,-1,'!!_(disambiguation)',2600
45795362,-1,'!!_(disambiguation)',2699

What I currently get is:

45171924,-1,'AbuseFilter/658',2600
43795362,-1,'!!_(disambiguation
45795362,-1,'!!_(disambiguation

What can I do in order to ignore the internal paren?

Thank you!!


In case it helps, here are the threads I've looked at:

1) REGEX-String and escaped quote

2) Regular expression to return text between parenthesis

3)Get the string within brackets in Python

Upvotes: 4

Views: 1951

Answers (4)

Avinash Raj
Avinash Raj

Reputation: 174706

You could use the below regex.

>>> import re
>>> s = "(45171924,-1,'AbuseFilter/658',2600),(43795362,-1,'!!_(disambiguation)',2600),(45795362,-1,'!!_(disambiguation)',2699)"
>>> for i in re.findall(r"\(((?:'[^']*'|[^()])*)\)", s):
        print(i)


45171924,-1,'AbuseFilter/658',2600
43795362,-1,'!!_(disambiguation)',2600
45795362,-1,'!!_(disambiguation)',2699

Explanation:

  • \( - Matches a literal ( symbol.
  • ( - Start of a capturing group.
  • (?:'[^']*'|[^()])* - '[^']*' part matches greedily the single quoted block. If there is any (, ) symbols present inside that, it won't care about that. Because we used [^']* which matches any character but not of ' , zero or more times. If the following character is not the start of a single quoted block then the control transfers to the pattern which exists next to the | symbol ie, [^()] which matches any character but not of ( or ). So the whole (?:'[^']*'|[^()])* will match a single quoted block or any char not of (, ) , zero or more times.
  • ) end of the capturing group.
  • \) literal ) symbol.

DEMO

Upvotes: 1

mmachine
mmachine

Reputation: 926

for element in S[1:-1].split('),('):
    print element

Upvotes: 1

hwnd
hwnd

Reputation: 70732

You can use a non-capturing group to assert either a comma or the end of the string follows:

p = re.compile(r'\((.*?)\)(?:,|$)')

Working Demo

Upvotes: 3

Zlatin Zlatev
Zlatin Zlatev

Reputation: 3089

Some simple approach would be negative lookahead - check that after closing brace no quote follows, e.g.

import re
S = "(45171924,-1,'AbuseFilter/658',2600),(43795362,-1,'!!_(disambiguation)',2600),(45795362,-1,'!!_(disambiguation)',2699)"

m = re.findall(r'\((.*?)\)(?![\'])', S)
for element in m:
    print element

prints

45171924,-1,'AbuseFilter/658',2600
43795362,-1,'!!_(disambiguation)',2600
45795362,-1,'!!_(disambiguation)',2699

http://www.codeskulptor.org/#user39_CL89xhroV0_0.py

I have put the quote in character class (square brackets), so that you could add other symbols, which should make the closing bracket being ignored.

Upvotes: 0

Related Questions