Reputation: 1
I maintain an api that can gets text input from multiple languages. We would like to make the encoding of string to be in utf-8
.
Most of the solutions that previous developers have tried involved using the encode and decode function willy nilly. It just leads to confusing unmaintainable code.
For simplicity I am just defining x
here but lets imagine this can be sent to my api. This string is encoded in latin-1
x = '\xe9toile' # x is a byte string in python 2
x.encode('utf-8')
results in
*** UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)
The only way that I know of to encode it to utf-8
is to first decode it as latin-1
then do the encoding.
x.decode('latin-1')
>>u'\xe9toile'
(x.decode('latin-1')).encode('utf-8')
>>'\xc3\xa9toile'
What if I did not know that the byte string was encoded in latin-1
how would I be able to encode it to utf-8
?
What would I do if x
was this chinese encoding that I don't know ?
x = '\u54c8\u54c8'
x
is always a byte string.
Any help would be appreciated.
Upvotes: 0
Views: 874
Reputation: 298364
If x
is a byte string then it doesn't make sense for you to encode it. Text encodings are a way to represent text as bytes. You first have to turn your bytes into text by decoding them and then encode that text into your target encoding.
What if I did not know that the byte string was encoded in
latin-1
how would I be able to encode it toutf-8
?
You can try to guess the encoding but you can't always be right:
>>> 'Vlh'.encode('cp037')
'\xe5\x93\x88'
>>> '哈'.encode('utf-8')
'\xe5\x93\x88'
This example is a little contrived but there's no way to know if the bytes '\xe5\x93\x88'
represent 哈
or Vlh
unless you know the original encoding.
The most sensible solution would be to just have your clients encode their text as UTF-8 and then you decode the bytes you receive as UTF-8.
Upvotes: 1