Regex unicode in python 2.x vs 3.x

Question

I have a simple function for tokenizing words.

import re
def tokenize(string):
    return re.split("(\W+)(?



In python 2.7 it behaves like this:

In [170]: tokenize('perché.')
Out[170]: ['perch', '\xc3\xa9.', '']


In python 3.5.0 I get this:

In [6]: tokenize('perché.')
Out[6]: ['perché', '.', '']


The problem is that 'é' should not be treated as a character to tokenize. I thoght that re.UNICODE could be enough to make \W work in the way I mean? 

How to get the same behaviour of python 3.x in python 2.x ?

Mark Tolonen · Accepted Answer

You'll want to use Unicode strings, but also the third parameter of split is not flags, but maxsplit:

>>> help(re.split)
Help on function split in module re:

split(pattern, string, maxsplit=0, flags=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list.

Example:

#!coding:utf8
from __future__ import print_function
import re
def tokenize(string):
    return re.split(r"(\W+)(?



Output:

C:\>py -2 test.py
[u'perch\xe9', u'.', u'']

C:\>py -3 test.py
['perché', '.', '']

Regex unicode in python 2.x vs 3.x

Answers (1)

Related Questions