Splitting string on characters, except when escaped

Question

I have a few strings, which I want to split on spaces and the characters ", ', (, ), ;, | and &, except when they are escaped with a \.

A few examples are as follows:

"hello-world" -> [r"hello-world"]
"hello;world " -> [r"hello", r"world"]
"he(llo)(w|o rld)" -> ["he", "llo", "w, "o", "rld"]
r"hello\;world" -> [r"hello\;world"]
r"hello\-world" -> [r"hello\-world"]

To do this, I wrote the regex:

r'''(?:[^\s"'();|&]+|\.)+'''

It works for all the other cases, except one:

>>> re.findall(r'''(?:[^\s"'();|&]+|\.)+''', r'hello\;world')
['hello\', 'world']

How can I modify the regex to get the expected result?

I'd prefer not to use re.split(); the regex above is part of a much larger regex used for tokenizing a programming language using .findall().

Wiktor Stribiżew · Accepted Answer

Your [^\s"'();|&]+ pattern part grabs the \ and then \. can't correctly match the escaped char.

You may use

(?:\.|[^\s"'();|&\])+

See the regex demo

Here, the pattern matches 1 or more repetitions of any escaped char (if you use re.DOTALL or re.S, even including line break chars), or any char other than whitespace, ", ', (, ), ;, |, & or \.

Python demo:

import re
strs = ['hello-world', r'hello;world ', r'he(llo)(w|o rld)', r'hello\;world',r'hello\-world ']
for s in strs:
    res = re.findall(r'''(?:\.|[^\s"'();|&\])+''', s)
    for val in res:
        print(val)
    print("-------------")

Output:

hello-world
-------------
hello
world
-------------
he
llo
w
o
rld
-------------
hello\;world
-------------
hello\-world
-------------

Splitting string on characters, except when escaped

Answers (1)

Related Questions