Reputation: 6764
When I use ANSI characters it works as expected:
>>> import re
>>> r = ur'(\w+)\s+(\w+)\s+(\w+)\?'
>>> s = 'what is it?'
>>> re.sub(r, ur'\1<br>\2<br>\3<br>', s, re.UNICODE)
u'what<br>is<br>it<br>'
But when I change the string s
to a similar one but contains of unicode characters - it doesn't work as I want:
>>> s = u'что это есть?'
>>> re.sub(r, ur'\1<br>\2<br>\3<br>', s, re.UNICODE)
u'\u0427\u0442\u043e \u044d\u0442\u043e \u0435\u0441\u0442\u044c?'
It looks strange (the string stays unchanged) because I use re.UNICODE
in the both cases... However re.match
successfully matches the groups with a UNICODE
flag:
>>> m = re.match(r, s, re.UNICODE)
>>> m.group(1)
u'\u0447\u0442\u043e'
>>> m.group(2)
u'\u044d\u0442\u043e'
>>> m.group(3)
u'\u0435\u0441\u0442\u044c'
Upvotes: 2
Views: 73
Reputation: 381
You have to specify re.UNICODE
as flags
parameter
re.sub(r, ur'\1<br>\2<br>\3<br>', s, flags = re.UNICODE)
Otherwise Python correctly assumes that the 4th parameter is count
, as specified in the re documenation.
Full Example:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
r = ur'(\w+)\s+(\w+)\s+(\w+)\?'
#s = 'what is it?'
s = u'что это есть?'
print re.sub(r, ur'\1<br>\2<br>\3<br>', s, flags = re.UNICODE).encode('utf-8')
Upvotes: 3