Function re.sub() refuses to work when I change an ANSI string to UNICODE one

Question

When I use ANSI characters it works as expected:

>>> import re
>>> r = ur'(\w+)\s+(\w+)\s+(\w+)\?'
>>> s = 'what is it?'
>>> re.sub(r, ur'\1
\2
\3
', s, re.UNICODE)
u'what
is
it
'

But when I change the string s to a similar one but contains of unicode characters - it doesn't work as I want:

>>> s = u'что это есть?'
>>> re.sub(r, ur'\1
\2
\3
', s, re.UNICODE)
u'\u0427\u0442\u043e \u044d\u0442\u043e \u0435\u0441\u0442\u044c?'

It looks strange (the string stays unchanged) because I use re.UNICODE in the both cases... However re.match successfully matches the groups with a UNICODE flag:

>>> m = re.match(r, s, re.UNICODE)
>>> m.group(1)
u'\u0447\u0442\u043e'
>>> m.group(2)
u'\u044d\u0442\u043e'
>>> m.group(3)
u'\u0435\u0441\u0442\u044c'

dd23 · Accepted Answer

You have to specify re.UNICODE as flags parameter

re.sub(r, ur'\1
\2
\3
', s, flags = re.UNICODE)

Otherwise Python correctly assumes that the 4th parameter is count, as specified in the re documenation.

Full Example:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re
r = ur'(\w+)\s+(\w+)\s+(\w+)\?'
#s = 'what is it?'
s = u'что это есть?'
print re.sub(r, ur'\1
\2
\3
', s, flags = re.UNICODE).encode('utf-8')

Function re.sub() refuses to work when I change an ANSI string to UNICODE one

Answers (1)

Related Questions