DATGUY
DATGUY

Reputation: 1

python 2 byte string that are not encoded in utf-8

I maintain an api that can gets text input from multiple languages. We would like to make the encoding of string to be in utf-8.

Most of the solutions that previous developers have tried involved using the encode and decode function willy nilly. It just leads to confusing unmaintainable code.

For simplicity I am just defining x here but lets imagine this can be sent to my api. This string is encoded in latin-1

x = '\xe9toile' # x is a byte string in python 2
x.encode('utf-8')

results in

*** UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)

The only way that I know of to encode it to utf-8 is to first decode it as latin-1 then do the encoding.

x.decode('latin-1')
>>u'\xe9toile'
(x.decode('latin-1')).encode('utf-8')
>>'\xc3\xa9toile'

What if I did not know that the byte string was encoded in latin-1 how would I be able to encode it to utf-8 ?

What would I do if x was this chinese encoding that I don't know ?

x = '\u54c8\u54c8'

x is always a byte string. Any help would be appreciated.

Upvotes: 0

Views: 874

Answers (1)

Blender
Blender

Reputation: 298364

If x is a byte string then it doesn't make sense for you to encode it. Text encodings are a way to represent text as bytes. You first have to turn your bytes into text by decoding them and then encode that text into your target encoding.

What if I did not know that the byte string was encoded in latin-1 how would I be able to encode it to utf-8?

You can try to guess the encoding but you can't always be right:

>>> 'Vlh'.encode('cp037')
'\xe5\x93\x88'
>>> '哈'.encode('utf-8')
'\xe5\x93\x88'

This example is a little contrived but there's no way to know if the bytes '\xe5\x93\x88' represent or Vlh unless you know the original encoding.

The most sensible solution would be to just have your clients encode their text as UTF-8 and then you decode the bytes you receive as UTF-8.

Upvotes: 1

Related Questions