Reputation: 13526
Not the first time, by this confused me:
Open the file with codecs.open
:
cfh = codecs.open('/tmp/ddfh', 'wb', 'utf-8')
Try to write the string, sa:
In [109]: sa
Out[109]: '\xe6\x96\xb0 \xe9\x97\xbb\xe3\x80\x80\xe7\xbd\x91 \xe9\xa1\xb5\xe3\x80\x80\xe8\xb4\xb4 \xe5\x90\xa7\xe3\x80\x80\xe7\x9f\xa5 \xe9\x81\x93\xe3\x80\x80\xe9\x9f\xb3 \xe4\xb9\x90\xe3\x80\x80\xe5\x9b\xbe \xe7\x89\x87\xe3\x80\x80\xe8\xa7\x86 \xe9\xa2\x91\xe3\x80\x80\xe5\x9c\xb0 \xe5\x9b\xbe'
In [110]: print sa
新 闻 网 页 贴 吧 知 道 音 乐 图 片 视 频 地 图
In [111]: sa.encode()
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
/home/za/tmp/<ipython-input-111-dea686030e89> in <module>()
----> 1 sa.encode()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)
In [112]: sa.decode()
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
/home/za/tmp/<ipython-input-112-a79b22010b0e> in <module>()
----> 1 sa.decode()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)
In [113]: sa.encode('utf-8')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
/home/za/tmp/<ipython-input-113-ed97f8f61eb5> in <module>()
----> 1 sa.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)
In [114]: sa.decode('utf-8')
Out[114]: u'\u65b0 \u95fb\u3000\u7f51 \u9875\u3000\u8d34 \u5427\u3000\u77e5 \u9053\u3000\u97f3 \u4e50\u3000\u56fe \u7247\u3000\u89c6 \u9891\u3000\u5730 \u56fe'
In [115]: cfh.write(sa.decode('utf-8'))
It works in the above, but FAILED with another machine, same Ubuntu machine, same $LANG env. I keep hitting "'ascii' codec can't ...."
Who can point me to a good doc? the official doc about module codecs
is not good for me.
===
The problem comes from the codes:
# encoding=utf-8
# ......
def write_video_info_file(folder, filename, infos):
# infos : a list of list, lines of text grouped by topic, results of language translations.
absfn = os.path.join(folder, filename)
with codecs.open(absfn, mode='wb', encoding='utf-8') as fh:
for vinfo in infos:
for v in vinfo:
fh.write(v)
fh.write("\n\n" + vi_delimit + "\n\n")
This was tested OK in my local machine, and deployed to a remote machine, then it get a lot: UnicodeDecodeError: 'ascii' codec can't
.
After it, nearly all mode=
, open without codecs tried.
$ echo $LANG # en_US.UTF-8
Python 2.7.3
Ubuntu 12.04
LANG=en_US.UTF-8
LANGUAGE=
LC_ALL=
===
I got the solution, use this to make sure all string are utf-8:
if isinstance(mystring, str):
mystring = mystring.decode('utf-8')
Upvotes: 0
Views: 4034
Reputation: 1121376
Your data is already encoded to UTF-8. Just open the file without codecs.open()
and write the data directly:
with open('/tmp/ddfh', 'wb') as output:
output.write(sa)
Unicode encoding / decoding errors usually occur because you are mixing byte strings and unicode strings; concatenation, comparisons, using str.join()
when you needed to use unicode.join()
instead, etc.
You may want to read up on Python and Unicode:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
Upvotes: 3