Converting Python 3 String of Bytes of Unicode - `str(utf8_encoded_str)` back to unicode

Question

Well, let me introduce the problem first.

I've got some data via POST/GET requests. The data were UTF-8 encoded string. Little did I know that, and converted it just by str() method. And now I have full database of "nonsense data" and couldn't find a way back.

Example code:

unicode_str - this is the string I should obtain

encoded_str - this is the string I got with POST/GET requests - initial data

bad_str - the data I have in the Database at the moment and I need to get unicode from.

So apparently I know how to convert: unicode_str =(encode)=> encoded_str =(str)=> bad_str

But I couldn't come up with solution back: bad_str =(???)=> encoded_str =(decode)=> unicode_str

In [1]: unicode_str = 'Příliš žluťoučký kůň úpěl ďábelské ódy'

In [2]: unicode_str
Out[2]: 'Příliš žluťoučký kůň úpěl ďábelské ódy'

In [3]: encoded_str = unicode_str.encode("UTF-8")

In [4]: encoded_str
Out[4]: b'P\xc5\x99\xc3\xadli\xc5\xa1 \xc5\xbelu\xc5\xa5ou\xc4\x8dk\xc3\xbd k\xc5\xaf\xc5\x88 \xc3\xbap\xc4\x9bl \xc4\x8f\xc3\xa1belsk\xc3\xa9 \xc3\xb3dy'

In [5]: bad_str = str(encoded_str)

In [6]: bad_str
Out[6]: "b'P\xc5\x99\xc3\xadli\xc5\xa1 \xc5\xbelu\xc5\xa5ou\xc4\x8dk\xc3\xbd k\xc5\xaf\xc5\x88 \xc3\xbap\xc4\x9bl \xc4\x8f\xc3\xa1belsk\xc3\xa9 \xc3\xb3dy'"

In [7]: new_encoded_str = some_magical_function_here(bad_str) ???

Reti43 · Accepted Answer

You turned a bytes object to a string, which is just a representation of the bytes object. You can obtain the original bytes object by using ast.literal_eval() (credits to Mark Tolonen for the suggestion), then a simple decode() will do the job.

>>> import ast
>>> ast.literal_eval(bad_str).decode('utf-8')
'Příliš žluťoučký kůň úpěl ďábelské ódy'

Since you were the one who generated the strings, using eval() would be safe, but why not be safer?

Converting Python 3 String of Bytes of Unicode - `str(utf8_encoded_str)` back to unicode

Example code:

Answers (2)

Related Questions