Gaby Solis
Gaby Solis

Reputation: 2587

Python Unicode String Replacement: u, r or nothing

HI take a look at the following code snippet on Python 2.7:

# -*- coding: utf-8 -*-
content = u"<p>和製英語とかカタカナ英語、<br/>ジャパングリッシュなどと呼ばれる英語っぽいけど実は英語じゃない言葉があります。</p>"
#print content
print content.replace(u"<p>",u"<div>").replace(u"</p>",u"</div>").replace(u"<br/>",u"")
print content.replace("<p>","<div>").replace("</p>","</div>").replace("<br/>","")
print content.replace(r"<p>",r"<div>").replace(r"</p>",r"</div>").replace(r"<br/>",r"")

The result is the same:

<div>和製英語とかカタカナ英語、ジャパングリッシュなどと呼ばれる英語っぽいけど実は英語じゃない言葉があります。</div>

My questions is: is there any difference between the three "replace" statements? (u, r or none?) Which one is the best?

Upvotes: 2

Views: 1649

Answers (2)

LSerni
LSerni

Reputation: 57418

In this case, there is no difference, because these strings are pure ASCII. So u"<div>" is the same thing as r"<div>" and "<div>" -- it's the five bytes, < d i v >.

UTF8 codes ASCII characters (below 0x80) as... the same ASCII characters below 0x80. So 'd' in ASCII is coded by byte 0x64, and its UTF8 code is again 0x64. Until there are no international characters (or better, characters outside the 00..7F hex range), there is no difference.

The difference appears as soon as one nonASCII character is used. For example in Italian 'Pero' is four characters, coded as four bytes P-e-r-o, and 'Però' is four characters coded as, I think, five bytes (or was it six?), P-e-r-0x80-0xF2 or something like that.

It might be argued (and I do) that the 'u' prefix should be used at all times: it makes no difference if it is not needed, and if it is needed it will save your data (the reasoning behind the UTF8 encoding was to promote exactly this type of backward compatibility: see http://en.wikipedia.org/wiki/UTF-8 ).

Upvotes: 0

Mark Tolonen
Mark Tolonen

Reputation: 177901

The first one is best. The second two options have to implicitly convert their byte strings to Unicode to do the replacement on the Unicode content string. Otherwise, with the strings provided, the result happens to be the same. If the replacement strings contained non-ASCII characters, there would be a UnicodeDecodeError on the second two because the default codec for the conversion is ascii on Python 2.X.

Note the speed difference as well:

C:\>python -m timeit -s "content=u'<p>blah<br/>blah</p>'" "content.replace(u'<p>',u'<div>').replace(u'</p>',u'</div>').replace(u'<br/>',u'')"
1000000 loops, best of 3: 1.09 usec per loop

C:\>python -m timeit -s "content=u'<p>blah<br/>blah</p>'" "content.replace('<p>','<div>').replace('</p>','</div>').replace('<br/>','')"
1000000 loops, best of 3: 1.76 usec per loop

C:\>python -m timeit -s "content=u'<p>blah<br/>blah</p>'" "content.replace(r'<p>',r'<div>').replace(r'</p>',r'</div>').replace(r'<br/>',r'')"
1000000 loops, best of 3: 1.75 usec per loop

Upvotes: 3

Related Questions