MaM
MaM

Reputation: 265

re.split() with special cases

I am new to regular expression and have a problem with the re.split functionality.

In my case the split has to care "special escapes".

The text should be seperated at ;, except there is a leading ?.

Edit: In that case the two parts shouldn't be splitted and the ? has to be removed.

Here an example and the result I wish:

import re
txt = 'abc;vwx?;yz;123'
re.split(r'magical pattern', txt)
['abc', 'vwx;yz', '123']

I tried so far these attempt:

re.split(r'(?<!\?);', txt)

and got:

['abc', 'vwx?;yz', '123']

Sadly causes the not consumed ? trouble and the following list comprehension is to performance critical:

[part.replace('?;', ';') for part in re.split(r'(?<!\?);', txt)]
['abc', 'vwx;yz', '123']

Is there a "fast" way to reproduce that behavior with re?

Could the re.findall function be the solution to take?

For example a extended version of this code:

re.findall(r'[^;]+', txt)

I am using python 2.7.3.

Thanking you in anticipation!

Upvotes: 4

Views: 338

Answers (4)

THM
THM

Reputation: 805

Try this :-)

def split( txt, sep, esc, escape_chars):
    ''' Split a string
        txt - string to split
        sep - separator, one character
        esc - escape character
        escape_chars - List of characters allowed to be escaped
    '''
    l = []
    tmp = []
    i = 0
    while i < len(txt):
        if len(txt) > i + 1 and txt[i] == esc and txt[i+1] in escape_chars:
            i += 1
            tmp.append(txt[i])
        elif txt[i] == sep:
            l.append("".join(tmp))
            tmp = []
        elif txt[i] == esc:
            print('Escape Error')
        else:
            tmp.append(txt[i])
        i += 1
    l.append("".join(tmp))
    return l

if __name__ == "__main__":
    txt = 'abc;vwx?;yz;123'
    print split(txt, ';', '?', [';','\\','?'])

Returns:

['abc', 'vwx;yz', '123']

Upvotes: 0

Julien Grenier
Julien Grenier

Reputation: 3394

I would do it like this:

 re.sub('(?<!\?);',r'|', txt).replace('?;',';').split('|')

Upvotes: 0

Janne Karila
Janne Karila

Reputation: 25197

Regex is not the tool for the job. Use the csv module instead:

>>> txt = 'abc;vwx?;yz;123'
>>> r = csv.reader([txt], delimiter=';', escapechar='?')
>>> next(r)
['abc', 'vwx;yz', '123']

Upvotes: 5

Martijn Pieters
Martijn Pieters

Reputation: 1122342

You cannot do what you want with one regular expression. Unescaping ?; after splitting is a separate task altogether, not one that you can get the re module to do for you while splitting at the same time.

Just keep the task separate; you could use a generator to do the unescaping for you:

def unescape(iterable):
    for item in iterable:
        yield item.replace('?;', ';')

for elem in unescape(re.split(r'(?<!\?);', txt)):
    print elem

but that won't be faster than your list comprehension.

Upvotes: 0

Related Questions