Reputation: 1033
I have a regex which works perfectly in Python 2:
parts = re.split(r'\s*', re.sub(r'^\s+|\s*$', '', expression)) # split expression into 5 parts
this regex will split an expression into 5 parts, for example,
'a * b = c' will be split into ['a', '*', 'b', '=', 'c'],
'11 + 12 = 23' will be split into ['11', '+', '12', '=', '23'],
'ab - c = d' will be split into ['ab', '-', 'c', '=', 'd'],
etc.
But in Python 3 this regex works quite differently,
'a * b = c' will be split into ['', 'a','', '*', '', 'b','', '=', '', 'c', ''],
'11 + 12 = 23' will be split into ['', '1', '1', '', '+', '', '1', '2', '', '=', '', '2', '3', ''],
'ab - c = d' will be split into ['', 'a', 'b', '', '-', '', 'c', '', '=', '', 'd', ''],
In general, in Python 3, each character in a part will be split into a separate part, and removed spaces(including none existing leading and trailing ) will become an empty part('') and will be added into the part list.
I think this Python 3 regex behavior differs QUITE big with Python 2, could anyone tell me the reason why Python 3 will change this much, and what is the correct regex to split an expression into 5 parts as in Python 2?
Upvotes: 4
Views: 3470
Reputation: 338178
The ability to split on zero-length matches was added to re.split()
in Python 3.7. When you change your split pattern to \s+
instead of \s*
, the behavior will be as expected in 3.7+ (and unchanged in Python < 3.7):
def parts(string)
return re.split(r'\s+', re.sub(r'^\s+|\s*$', '', string))
test:
>>> print(parts('a * b = c'))
['a', '*', 'b', '=', 'c']
>>> print(parts('ab - c = d'))
['ab', '-', 'c', '=', 'd']
>>> print(parts('a * b = c'))
['a', '*', 'b', '=', 'c']
>>> print(parts('11 + 12 = 23'))
['11', '+', '12', '=', '23']
The regex
module, a drop-in replacement for re
, has a "V1" mode that makes existing patterns behave like they did before Python 3.7 (see this answer).
Upvotes: 5