Reputation: 37
I'm trying to match the € symbol in a string but I get a strange behaviour when using the special character "?". It works as expected with normal characters
import re
print re.match(r'a?1', 'a1')
<_sre.SRE_Match object at 0x3a2ba58>
print re.match(r'a?1', '1')
<_sre.SRE_Match object at 0x3a2ba58>
but with the € symbol I get this output
print re.match(r'€?1', '€1')
<_sre.SRE_Match object at 0x3a2ba58>
print re.match(r'€?1', '1')
None
Any idea about what's going on? I suspect it's something related to unicode. I'm using python 2.7. Thank you.
Upvotes: 2
Views: 1030
Reputation: 37103
You will notice that the question has had the python2.7
tag added, as this question is version-specific.
By default Python assumes that your source will be UTF-8 encoded. It sees the Euro symbol as a sequence of UTF-8 bytes which appear as individual bytes in your string:
>>> r'€?1'
'\xe2\x82\xac?1'
So you specified matching using a pattern that requires the ASCII bytes \xe2\x82
to be followed optionally by an \xac
byte and finally a mandatory digit 1
.
Explicitly identifying your Unicode literals as such will cure the problem.
>>> m = re.match(u'€?1', u'€1')
>>> m.start(), m.end()
(0, 2)
>>> m = re.match(u'€?1', u'1')
>>> m.start(), m.end()
(0, 1)
Moving to Python 3, the interpreter assumes that all string literals are Unicode unless they are explicitly tagged as bytestrings using a b'...'
literal, so the problem doesn't occur:
>>> r'€?1'
'€?1'
>>> m = re.match(r'€?1', '€1')
>>> m.start(), m.end()
(0, 2)
Upvotes: 1
Reputation: 7671
€ is not an ascii character, so you need to use unicode matching:
print re.match(ur'€?1', u'€1', flags=re.UNICODE)
<_sre.SRE_Match object at 0x7ffde0084bf8>
print re.match(ur'€?1', u'1', flags=re.UNICODE)
<_sre.SRE_Match object at 0x7ffde0084bf8>
Upvotes: 2