Reputation: 729
Below is the string I am trying to split on comma.
If comma is present inside () or {}, that string shouldn't be splitted.
I am splitting using below code:
for now it's only taking care of (), how do i include {} as well?
import re
s = "Water,^.*f04.*&~(.*z.,*)$,Iron Oxides (CI 77491, 77492),a{3,4}"
print re.split(r',\s*(?![^()]*\))', s)
Output should be :
[Water,^.*f04.*&~(.*z.,*)$,Iron Oxides (CI 77491, 77492),a{3,4}]
Upvotes: 1
Views: 1832
Reputation: 42007
With the regex
module that allows variable length lookbehind:
regex.split(r'(?<![({][^,]*),(?![^,]*[})])', str_)
(?<![({][^,]*)
is a zero-width negative lookbehind that makes sure ,
is not preceded by (
or {
and there is no ,
in between
,
matches literal ,
(?![^,]*[})])
is a zero-width negative lookahead that makes sure the ,
is not followed by any intermediate ,
followed by )
or }
Example:
In [1287]: str_ = "Water,^.*f04.*&~(.*z.,*)$,Iron Oxides (CI 77491, 77492),a{3,4}"
In [1288]: regex.split(r'(?<![({][^,]*),(?![^,]*[})])', str_)
Out[1288]: ['Water', '^.*f04.*&~(.*z.,*)$', 'Iron Oxides (CI 77491, 77492)', 'a{3,4}']
Limitations:
[({]
matches any of (
/{
, and [})]
matches any of )
/}
, so this could lead to bugs when e.g. the substring starts with (
and ends in }
or the other way around
Won't work for nested parentheses/brackets
Upvotes: 1
Reputation: 365707
Assuming your brackets can be nested, what you have isn't a regular language. While re
does have a lot of extensions that let it handle things beyond actual regular expressions, it's probably better to just approach this with a trivial bracket-counting parser.
Something like this (untested, but it should be simple enough to understand and debug):
bracketmap = {'(': ')', '[': ']', '{': '}'}
def splitify(s):
stack = []
lastcomma = 0
for i, c in enumerate(s):
if not stack and c == ',':
yield s[lastcomma:i]
lastcomma = i+1
elif c in bracketmap:
stack.append(bracketmap[c])
elif c in ')]}':
if stack.pop() != c:
raise ValueError('unbalanced brackets')
if stack:
raise ValueError('unbalanced brackets')
if lastcomma <= len(s):
yield s[lastcomma:]
From a comment, when asked whether your brackets can be nested, you said:
it can be if it appears to be a valid regex.
So, if the string is actually meant to be a regex pattern, you need to do more than just exclude commas inside brackets. For example, \{,\}
isn't a comma inside a braces-for-counting, it's a perfectly normal literal comma.
Writing a full regex parser is obviously a bit more complicated than just counting bracket pairs (although if you want exactly the Python re
syntax, you can just use that library to compile it and then use the library's debugging tools to scan for literal parens, instead of writing it yourself…), but maybe you can get away with just counting unescaped bracket pairs?
esc = False
for i, c in enumerate(s):
if esc:
esc = False
elif c = '\\':
esc = True
elif not stack and c == ',':
# same as before
(I'm assuming here that you don't want to treat \,
as a literal comma. If you do, that's a trivial change.)
Upvotes: 1