bard
bard

Reputation: 3052

How to split by commas that are not within parentheses?

Say I have a string like this, where items are separated by commas but there may also be commas within items that have parenthesized content:

(EDIT: Sorry, forgot to mention that some items may not have parenthesized content)

"Water, Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"

How can I split the string by only those commas that are NOT within parentheses? i.e:

["Water", "Titanium Dioxide (CI 77897)", "Black 2 (CI 77266)", "Iron Oxides (CI 77491, 77492, 77499)", "Ultramarines (CI 77007)"]

I think I'd have to use a regex, perhaps something like this:

([(]?)(.*?)([)]?)(,|$)

but I'm still trying to make it work.

Upvotes: 23

Views: 14182

Answers (7)

Georgi Peev
Georgi Peev

Reputation: 11

Here are two shorter (more elegant?) versions that will deal with nested parentheses.

A generator:

def split(s, sep=","):
    i = d = 0
    for j in range(len(s)):
        d += {"(": 1, ")": -1}.get(s[j], 0)
        if s[j] == sep and d == 0:
            yield s[i:j]
            i = j + 1
    yield s[i:]

And a more functional style one:

def split(s, sep=","):
    b = accumulate(s, lambda br, ch: br + {"(": 1, ")": -1}.get(ch, 0), initial=0)
    c = (ch != sep for ch in s)
    st = [i for i, x in enumerate(chain([0], starmap(int.__or__, zip(b, c)), [0])) if x == 0]
    return [s[st[i]:st[i + 1] - 1] for i in range(len(st) - 1)]

If you don't mind more_itertools, you can import locate from it and change the 4th line to be slightly more readable: st = list(locate(chain([0], starmap(int.__or__, zip(b, c)), [0]), (0).__eq__))

Upvotes: 0

nerdfever.com
nerdfever.com

Reputation: 1782

This version seems to work with nested parenthesis, brackets ([] or <>), and braces:

def split_top(string, splitter, openers="([{<", closers = ")]}>", whitespace=" \n\t"):
    ''' Splits strings at occurance of 'splitter' but only if not enclosed by brackets.
        Removes all whitespace immediately after each splitter.
        This assumes brackets, braces, and parens are properly matched - may fail otherwise '''

outlist = []
outstring = []

depth = 0

for c in string:
    if c in openers:
        depth += 1
    elif c in closers:
        depth -= 1

        if depth < 0:
            raise SyntaxError()

    if not depth and c == splitter:
        outlist.append("".join(outstring))
        outstring = []
    else:
        if len(outstring):
            outstring.append(c)
        elif c not in whitespace:
            outstring.append(c)

outlist.append("".join(outstring))

return outlist

Use it like this:

s = "Water, Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"

split = split_top(s, ",") # splits on commas

It's probably not the fastest thing ever, I know.

Upvotes: 0

Marcin
Marcin

Reputation: 4289

I believe I have a simpler regexp for this:

rx_comma = re.compile(r",(?![^(]*\))")
result = rx_comma.split(string_to_split)

Explanation of the regexp:

  • Match , that:
  • Is NOT followed by:
    • A list of characters ending with ), where:
    • A list of characters between , and ) does not contain (

It will not work in case of nested parentheses, like a,b(c,d(e,f)). If one needs this, a possible solution is to go through a result of split and in case of strings having an open parentheses without closing, do a merge :), like:

"a"
"b(c" <- no closing, merge this 
"d(e" <- no closing, merge this
"f))

Upvotes: 2

Vishnu Upadhyay
Vishnu Upadhyay

Reputation: 5061

You can just do it using str.replace and str.split. You may use any character to replace ),.

a = "Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"
a = a.replace('),', ')//').split('//')
print a

output:-

['Titanium Dioxide (CI 77897)', ' Black 2 (CI 77266)', ' Iron Oxides (CI 77491, 77492, 77499)', ' Ultramarines (CI 77007)']

Upvotes: 1

asimoneau
asimoneau

Reputation: 692

Using regex, this can be done easily with the findall function.

import re
s = "Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"
re.findall(r"\w.*?\(.*?\)", s) # returns what you want

Use http://www.regexr.com/ if you want to understand regex better, and here is the link to the python documentation : https://docs.python.org/2/library/re.html

EDIT : I modified the regex string to accept content without parenthesis : \w[^,(]*(?:\(.*?\))?

Upvotes: -1

Avinash Raj
Avinash Raj

Reputation: 174706

Use a negative lookahead to match all the commas which are not inside the parenthesis. Splitting the input string according to the matched commas will give you the desired output.

,\s*(?![^()]*\))

DEMO

>>> import re
>>> s = "Water, Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"
>>> re.split(r',\s*(?![^()]*\))', s)
['Water', 'Titanium Dioxide (CI 77897)', 'Black 2 (CI 77266)', 'Iron Oxides (CI 77491, 77492, 77499)', 'Ultramarines (CI 77007)']

Upvotes: 45

nu11p01n73R
nu11p01n73R

Reputation: 26667

Try the regex

[^()]*\([^()]*\),?

code:

>>x="Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"
>> re.findall("[^()]*\([^()]*\),?",x)
['Titanium Dioxide (CI 77897),', ' Black 2 (CI 77266),', ' Iron Oxides (CI 77491, 77492, 77499),', ' Ultramarines (CI 77007)']

see how the regex works http://regex101.com/r/pS9oV3/1

Upvotes: -1

Related Questions