Angelo
Angelo

Reputation: 787

Regex unicode in python 2.x vs 3.x

I have a simple function for tokenizing words.

import re
def tokenize(string):
    return re.split("(\W+)(?<!')",string,re.UNICODE)

In python 2.7 it behaves like this:

In [170]: tokenize('perché.')
Out[170]: ['perch', '\xc3\xa9.', '']

In python 3.5.0 I get this:

In [6]: tokenize('perché.')
Out[6]: ['perché', '.', '']

The problem is that 'é' should not be treated as a character to tokenize. I thoght that re.UNICODE could be enough to make \W work in the way I mean?

How to get the same behaviour of python 3.x in python 2.x ?

Upvotes: 1

Views: 402

Answers (1)

Mark Tolonen
Mark Tolonen

Reputation: 177610

You'll want to use Unicode strings, but also the third parameter of split is not flags, but maxsplit:

>>> help(re.split)
Help on function split in module re:

split(pattern, string, maxsplit=0, flags=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list.

Example:

#!coding:utf8
from __future__ import print_function
import re
def tokenize(string):
    return re.split(r"(\W+)(?<!')",string,flags=re.UNICODE)

print(tokenize(u'perché.'))

Output:

C:\>py -2 test.py
[u'perch\xe9', u'.', u'']

C:\>py -3 test.py
['perché', '.', '']

Upvotes: 2

Related Questions