zubug55
zubug55

Reputation: 729

Split string on comma not present in round brackets or curly brackets in python

Below is the string I am trying to split on comma.

If comma is present inside () or {}, that string shouldn't be splitted.

I am splitting using below code:

for now it's only taking care of (), how do i include {} as well?

import re
s = "Water,^.*f04.*&~(.*z.,*)$,Iron Oxides (CI 77491, 77492),a{3,4}"
print re.split(r',\s*(?![^()]*\))', s)

Output should be :

[Water,^.*f04.*&~(.*z.,*)$,Iron Oxides (CI 77491, 77492),a{3,4}]

Upvotes: 1

Views: 1832

Answers (2)

heemayl
heemayl

Reputation: 42007

With the regex module that allows variable length lookbehind:

regex.split(r'(?<![({][^,]*),(?![^,]*[})])', str_)
  • (?<![({][^,]*) is a zero-width negative lookbehind that makes sure , is not preceded by ( or { and there is no , in between

  • , matches literal ,

  • (?![^,]*[})]) is a zero-width negative lookahead that makes sure the , is not followed by any intermediate , followed by ) or }

Example:

In [1287]: str_ = "Water,^.*f04.*&~(.*z.,*)$,Iron Oxides (CI 77491, 77492),a{3,4}"

In [1288]: regex.split(r'(?<![({][^,]*),(?![^,]*[})])', str_)
Out[1288]: ['Water', '^.*f04.*&~(.*z.,*)$', 'Iron Oxides (CI 77491, 77492)', 'a{3,4}']

Limitations:

  • [({] matches any of (/{, and [})] matches any of )/}, so this could lead to bugs when e.g. the substring starts with ( and ends in } or the other way around

  • Won't work for nested parentheses/brackets

Upvotes: 1

abarnert
abarnert

Reputation: 365707

Assuming your brackets can be nested, what you have isn't a regular language. While re does have a lot of extensions that let it handle things beyond actual regular expressions, it's probably better to just approach this with a trivial bracket-counting parser.

Something like this (untested, but it should be simple enough to understand and debug):

bracketmap = {'(': ')', '[': ']', '{': '}'}

def splitify(s):
    stack = []
    lastcomma = 0
    for i, c in enumerate(s):
        if not stack and c == ',':
            yield s[lastcomma:i]
            lastcomma = i+1
        elif c in bracketmap:
            stack.append(bracketmap[c])
        elif c in ')]}':
            if stack.pop() != c:
                raise ValueError('unbalanced brackets')
    if stack:
        raise ValueError('unbalanced brackets')
    if lastcomma <= len(s):
        yield s[lastcomma:]

From a comment, when asked whether your brackets can be nested, you said:

it can be if it appears to be a valid regex.

So, if the string is actually meant to be a regex pattern, you need to do more than just exclude commas inside brackets. For example, \{,\} isn't a comma inside a braces-for-counting, it's a perfectly normal literal comma.

Writing a full regex parser is obviously a bit more complicated than just counting bracket pairs (although if you want exactly the Python re syntax, you can just use that library to compile it and then use the library's debugging tools to scan for literal parens, instead of writing it yourself…), but maybe you can get away with just counting unescaped bracket pairs?

    esc = False
    for i, c in enumerate(s):
        if esc:
            esc = False
        elif c = '\\':
            esc = True
        elif not stack and c == ',':
            # same as before

(I'm assuming here that you don't want to treat \, as a literal comma. If you do, that's a trivial change.)

Upvotes: 1

Related Questions