Reputation: 9039
I want to do this:
Take the bytes of this utf-8 string:
访视频
Encode those bytes in latin-1 and print the result:
访视频
How do I do this in Python?
# -*- coding: utf-8
s = u'访视频'.encode('latin-1')
Causes this exception:
s = u'访视频'.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256)
Upvotes: 3
Views: 12408
Reputation: 107297
you need to first encode to UTF-8
(UTF-8
can encode any Unicode string) and yet fully compatible with the 7-bit ASCII
set (any ASCII bytestring is a correct UTF-8–encoded
string). :
>>> u'访视频'.encode('UTF-8').decode('latin-1')
u'\xe8\xae\xbf\xe8\xa7\x86\xe9\xa2\x91'
Note : The UTF-8
encoding can handle any Unicode character. It is also backwards
compatible with ASCII
, so that a pure ASCII
file can also be considered a UTF-8
file, and a UTF-8
file that happens to use only ASCII
characters is identical to an
ASCII
file with the same characters
Upvotes: 2
Reputation: 365787
What you're asking to do is literally impossible. You can't encode those characters to Latin-1, because those characters don't exist in Latin-1.
To get the output you want, you want to decode the UTF-8 bytes as if they were Latin-1. Like this:
s = u'访视频'.encode('utf-8').decode('latin-1')
However, your desired output doesn't look like actual Latin-1, because in Latin-1, characters \x86
and \x91
are non-printable, so you're going to get this:
è®¿è§ é¢
(Notice that space in the middle in place of †
, and the missing ‘
at the end; those are actually invisible control characters, not spaces.)
It looks like you want a Latin-1 superset, probably Windows codepage 1252. In which case what you really want is:
s = u'访视频'.encode('utf-8').decode('cp1252')
Upvotes: 7