python 2 byte string that are not encoded in utf-8

Question

I maintain an api that can gets text input from multiple languages. We would like to make the encoding of string to be in utf-8.

Most of the solutions that previous developers have tried involved using the encode and decode function willy nilly. It just leads to confusing unmaintainable code.

For simplicity I am just defining x here but lets imagine this can be sent to my api. This string is encoded in latin-1

x = '\xe9toile' # x is a byte string in python 2
x.encode('utf-8')

results in

*** UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)

The only way that I know of to encode it to utf-8 is to first decode it as latin-1 then do the encoding.

x.decode('latin-1')
>>u'\xe9toile'
(x.decode('latin-1')).encode('utf-8')
>>'\xc3\xa9toile'

What if I did not know that the byte string was encoded in latin-1 how would I be able to encode it to utf-8 ?

What would I do if x was this chinese encoding that I don't know ?

x = '\u54c8\u54c8'

x is always a byte string. Any help would be appreciated.

python 2 byte string that are not encoded in utf-8

Answers (1)

Related Questions