Reputation: 787
I have a simple function for tokenizing words.
import re
def tokenize(string):
return re.split("(\W+)(?<!')",string,re.UNICODE)
In python 2.7 it behaves like this:
In [170]: tokenize('perché.')
Out[170]: ['perch', '\xc3\xa9.', '']
In python 3.5.0 I get this:
In [6]: tokenize('perché.')
Out[6]: ['perché', '.', '']
The problem is that 'é' should not be treated as a character to tokenize. I thoght that re.UNICODE
could be enough to make \W
work in the way I mean?
How to get the same behaviour of python 3.x in python 2.x ?
Upvotes: 1
Views: 402
Reputation: 177610
You'll want to use Unicode strings, but also the third parameter of split
is not flags
, but maxsplit
:
>>> help(re.split)
Help on function split in module re:
split(pattern, string, maxsplit=0, flags=0)
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings. If
capturing parentheses are used in pattern, then the text of all
groups in the pattern are also returned as part of the resulting
list. If maxsplit is nonzero, at most maxsplit splits occur,
and the remainder of the string is returned as the final element
of the list.
Example:
#!coding:utf8
from __future__ import print_function
import re
def tokenize(string):
return re.split(r"(\W+)(?<!')",string,flags=re.UNICODE)
print(tokenize(u'perché.'))
Output:
C:\>py -2 test.py
[u'perch\xe9', u'.', u'']
C:\>py -3 test.py
['perché', '.', '']
Upvotes: 2