Reputation:
I have a few strings, which I want to split on spaces and the characters "
, '
, (
, )
, ;
, |
and &
, except when they are escaped with a \
.
A few examples are as follows:
"hello-world" -> [r"hello-world"]
"hello;world " -> [r"hello", r"world"]
"he(llo)(w|o rld)" -> ["he", "llo", "w, "o", "rld"]
r"hello\;world" -> [r"hello\;world"]
r"hello\-world" -> [r"hello\-world"]
To do this, I wrote the regex:
r'''(?:[^\s"'();|&]+|\\.)+'''
It works for all the other cases, except one:
>>> re.findall(r'''(?:[^\s"'();|&]+|\\.)+''', r'hello\;world')
['hello\\', 'world']
How can I modify the regex to get the expected result?
I'd prefer not to use re.split()
; the regex above is part of a much larger regex used for tokenizing a programming language using .findall()
.
Upvotes: 1
Views: 87
Reputation: 626929
Your [^\s"'();|&]+
pattern part grabs the \
and then \\.
can't correctly match the escaped char.
You may use
(?:\\.|[^\s"'();|&\\])+
See the regex demo
Here, the pattern matches 1 or more repetitions of any escaped char (if you use re.DOTALL
or re.S
, even including line break chars), or any char other than whitespace, "
, '
, (
, )
, ;
, |
, &
or \
.
import re
strs = ['hello-world', r'hello;world ', r'he(llo)(w|o rld)', r'hello\;world',r'hello\-world ']
for s in strs:
res = re.findall(r'''(?:\\.|[^\s"'();|&\\])+''', s)
for val in res:
print(val)
print("-------------")
Output:
hello-world
-------------
hello
world
-------------
he
llo
w
o
rld
-------------
hello\;world
-------------
hello\-world
-------------
Upvotes: 1