Reputation: 265
I am new to regular expression and have a problem with the re.split functionality.
In my case the split has to care "special escapes".
The text should be seperated at ;
, except there is a leading ?
.
Edit: In that case the two parts shouldn't be splitted and the ?
has to be removed.
Here an example and the result I wish:
import re
txt = 'abc;vwx?;yz;123'
re.split(r'magical pattern', txt)
['abc', 'vwx;yz', '123']
I tried so far these attempt:
re.split(r'(?<!\?);', txt)
and got:
['abc', 'vwx?;yz', '123']
Sadly causes the not consumed ?
trouble and the following list comprehension is to performance critical:
[part.replace('?;', ';') for part in re.split(r'(?<!\?);', txt)]
['abc', 'vwx;yz', '123']
Is there a "fast" way to reproduce that behavior with re?
Could the re.findall function be the solution to take?
For example a extended version of this code:
re.findall(r'[^;]+', txt)
I am using python 2.7.3.
Thanking you in anticipation!
Upvotes: 4
Views: 338
Reputation: 805
Try this :-)
def split( txt, sep, esc, escape_chars):
''' Split a string
txt - string to split
sep - separator, one character
esc - escape character
escape_chars - List of characters allowed to be escaped
'''
l = []
tmp = []
i = 0
while i < len(txt):
if len(txt) > i + 1 and txt[i] == esc and txt[i+1] in escape_chars:
i += 1
tmp.append(txt[i])
elif txt[i] == sep:
l.append("".join(tmp))
tmp = []
elif txt[i] == esc:
print('Escape Error')
else:
tmp.append(txt[i])
i += 1
l.append("".join(tmp))
return l
if __name__ == "__main__":
txt = 'abc;vwx?;yz;123'
print split(txt, ';', '?', [';','\\','?'])
Returns:
['abc', 'vwx;yz', '123']
Upvotes: 0
Reputation: 3394
I would do it like this:
re.sub('(?<!\?);',r'|', txt).replace('?;',';').split('|')
Upvotes: 0
Reputation: 25197
Regex is not the tool for the job. Use the csv
module instead:
>>> txt = 'abc;vwx?;yz;123'
>>> r = csv.reader([txt], delimiter=';', escapechar='?')
>>> next(r)
['abc', 'vwx;yz', '123']
Upvotes: 5
Reputation: 1122342
You cannot do what you want with one regular expression. Unescaping ?;
after splitting is a separate task altogether, not one that you can get the re
module to do for you while splitting at the same time.
Just keep the task separate; you could use a generator to do the unescaping for you:
def unescape(iterable):
for item in iterable:
yield item.replace('?;', ';')
for elem in unescape(re.split(r'(?<!\?);', txt)):
print elem
but that won't be faster than your list comprehension.
Upvotes: 0