Reputation: 7493
I use a tokenizer to split french sentences into words and had problems with words containing the french character â
.
I tried to isolate the problem and it eventually boiled down to this simple fact:
>>> re.match(r"’", u'â', re.U)
>>> re.match(r"[’]", u'â', re.U)
<_sre.SRE_Match object at 0x21d41d0>
â
is matched by a pattern containing ’
if it's put in an ensemble matcher.
Is there something wrong on my part regarding UTF-8 handling or is it a bug?
My python version is:
Python 2.7.3 (default, Jan 2 2013, 13:56:14)
[GCC 4.7.2] on linux2
EDIT:
Hum, embarassingly enough, it seems that replacing the r
prefixing the pattern with a u
fixes the issue.
I wonder why the official documentation uses extensively r
then :((
Upvotes: 7
Views: 252
Reputation: 62908
Your pattern should be a unicode string too:
>>> re.match(ur"’", u'â', re.U)
>>> re.match(ur"[’]", u'â', re.U)
Otherwise apparently sre
encodes â
to latin-1 and finds the resulting byte in the three bytes that is a utf-8 ’
.
"[’]"
is equivalent to "[\xe2\x80\x99]"
, and u'â'.encode('latin-1')
is \xe2
.
Upvotes: 7