dr.umma
dr.umma

Reputation: 37

Match the € symbol in re with python

I'm trying to match the € symbol in a string but I get a strange behaviour when using the special character "?". It works as expected with normal characters

import re

print re.match(r'a?1', 'a1')
<_sre.SRE_Match object at 0x3a2ba58>

print re.match(r'a?1', '1')
<_sre.SRE_Match object at 0x3a2ba58>

but with the € symbol I get this output

print re.match(r'€?1', '€1')
<_sre.SRE_Match object at 0x3a2ba58>

print re.match(r'€?1', '1')
None

Any idea about what's going on? I suspect it's something related to unicode. I'm using python 2.7. Thank you.

Upvotes: 2

Views: 1030

Answers (2)

holdenweb
holdenweb

Reputation: 37103

You will notice that the question has had the python2.7 tag added, as this question is version-specific.

By default Python assumes that your source will be UTF-8 encoded. It sees the Euro symbol as a sequence of UTF-8 bytes which appear as individual bytes in your string:

>>> r'€?1'
'\xe2\x82\xac?1'

So you specified matching using a pattern that requires the ASCII bytes \xe2\x82 to be followed optionally by an \xac byte and finally a mandatory digit 1.

Explicitly identifying your Unicode literals as such will cure the problem.

>>> m = re.match(u'€?1', u'€1')
>>> m.start(), m.end()
(0, 2)
>>> m = re.match(u'€?1', u'1')
>>> m.start(), m.end()
(0, 1)

Moving to Python 3, the interpreter assumes that all string literals are Unicode unless they are explicitly tagged as bytestrings using a b'...' literal, so the problem doesn't occur:

>>> r'€?1'
'€?1'
>>> m = re.match(r'€?1', '€1')
>>> m.start(), m.end()
(0, 2)

Upvotes: 1

dwitvliet
dwitvliet

Reputation: 7671

€ is not an ascii character, so you need to use unicode matching:

print re.match(ur'€?1', u'€1', flags=re.UNICODE)
<_sre.SRE_Match object at 0x7ffde0084bf8>

print re.match(ur'€?1', u'1', flags=re.UNICODE)
<_sre.SRE_Match object at 0x7ffde0084bf8>

Upvotes: 2

Related Questions