user2064000
user2064000

Reputation:

Splitting string on characters, except when escaped

I have a few strings, which I want to split on spaces and the characters ", ', (, ), ;, | and &, except when they are escaped with a \.

A few examples are as follows:

"hello-world" -> [r"hello-world"]
"hello;world " -> [r"hello", r"world"]
"he(llo)(w|o rld)" -> ["he", "llo", "w, "o", "rld"]
r"hello\;world" -> [r"hello\;world"]
r"hello\-world" -> [r"hello\-world"]

To do this, I wrote the regex:

r'''(?:[^\s"'();|&]+|\\.)+'''

It works for all the other cases, except one:

>>> re.findall(r'''(?:[^\s"'();|&]+|\\.)+''', r'hello\;world')
['hello\\', 'world']

How can I modify the regex to get the expected result?

I'd prefer not to use re.split(); the regex above is part of a much larger regex used for tokenizing a programming language using .findall().

Upvotes: 1

Views: 87

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626929

Your [^\s"'();|&]+ pattern part grabs the \ and then \\. can't correctly match the escaped char.

You may use

(?:\\.|[^\s"'();|&\\])+

See the regex demo

Here, the pattern matches 1 or more repetitions of any escaped char (if you use re.DOTALL or re.S, even including line break chars), or any char other than whitespace, ", ', (, ), ;, |, & or \.

Python demo:

import re
strs = ['hello-world', r'hello;world ', r'he(llo)(w|o rld)', r'hello\;world',r'hello\-world ']
for s in strs:
    res = re.findall(r'''(?:\\.|[^\s"'();|&\\])+''', s)
    for val in res:
        print(val)
    print("-------------")

Output:

hello-world
-------------
hello
world
-------------
he
llo
w
o
rld
-------------
hello\;world
-------------
hello\-world
-------------

Upvotes: 1

Related Questions