Reputation: 38701
Uhh, Python 2 / 3 is so frustrating... Consider this example, test.py
:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
if sys.version_info[0] < 3:
text_type = unicode
binary_type = str
def b(x):
return x
def u(x):
return unicode(x, "utf-8")
else:
text_type = str
binary_type = bytes
import codecs
def b(x):
return codecs.latin_1_encode(x)[0]
def u(x):
return x
tstr = " ▲ "
sys.stderr.write(tstr)
sys.stderr.write("\n")
sys.stderr.write(str(len(tstr)))
sys.stderr.write("\n")
Running it:
$ python2.7 test.py
▲
5
$ python3.2 test.py
▲
3
Great, I get two differing string sizes. Hopefully wrapping the string in one of these wrappers I found around the net will help?
For tstr = text_type(" ▲ ")
:
$ python2.7 test.py
Traceback (most recent call last):
File "test.py", line 21, in <module>
tstr = text_type(" ▲ ")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)
$ python3.2 test.py
▲
3
For tstr = u(" ▲ ")
:
$ python2.7 test.py
Traceback (most recent call last):
File "test.py", line 21, in <module>
tstr = u(" ▲ ")
File "test.py", line 11, in u
return unicode(x)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)
$ python3.2 test.py
▲
3
For tstr = b(" ▲ ")
:
$ python2.7 test.py
▲
5
$ python3.2 test.py
Traceback (most recent call last):
File "test.py", line 21, in <module>
tstr = b(" ▲ ")
File "test.py", line 17, in b
return codecs.latin_1_encode(x)[0]
UnicodeEncodeError: 'latin-1' codec can't encode character '\u25b2' in position 1: ordinal not in range(256)
For tstr = binary_type(" ▲ ")
:
$ python2.7 test.py
▲
5
$ python3.2 test.py
Traceback (most recent call last):
File "test.py", line 21, in <module>
tstr = binary_type(" ▲ ")
TypeError: string argument without an encoding
Well, that certainly makes things easy.
So, how to get the same string length (in this case, 3) in both Python 2.7 and 3.2?
Upvotes: 4
Views: 991
Reputation: 38701
Well, turns out unicode() in Python 2.7 has an encoding
argument, and that apparently helps:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
if sys.version_info[0] < 3:
text_type = unicode
binary_type = str
def b(x):
return x
def u(x):
return unicode(x, "utf-8")
else:
text_type = str
binary_type = bytes
import codecs
def b(x):
return codecs.latin_1_encode(x)[0]
def u(x):
return x
tstr = u(" ▲ ")
sys.stderr.write(tstr)
sys.stderr.write("\n")
sys.stderr.write(str(len(tstr)))
sys.stderr.write("\n")
Running this, I get what I needed:
$ python2.7 test.py
▲
3
$ python3.2 test.py
▲
3
Upvotes: 6