thefragileomen
thefragileomen

Reputation: 1547

decoding unicode string variables in Python

I am using an API in Python v2.7 to obtain a string, the content of which is unknown. The content can be in English, German or French. The variable name assigned to the returned string is 'category'. An example of a returned value for the variable category is:-

"temp\\u00eate de poussi\\u00e8res"

I have tried category.decode('utf-8') to decode the string into, in the above case, French, but unfortunately it still returns the same value, with an additional unicode 'u' at the beginning when I print the result of category.decode('utf-8').

u'"temp\\u00eate de poussi\\u00e8res'

I also tried category.encode('utf-8') but it returns the same value (minus the 'u' that precedes the string:-

'"temp\\u00eate de poussi\\u00e8res"'

Any suggestions?

Upvotes: 1

Views: 1632

Answers (2)

Mark Tolonen
Mark Tolonen

Reputation: 177901

It looks like the API uses JSON. You can decode it with the json module:

>>> import json
>>> json.loads('"temp\\u00eate de poussi\\u00e8res"')
u'temp\xeate de poussi\xe8res'
>>> print(json.loads('"temp\\u00eate de poussi\\u00e8res"'))
tempête de poussières

Upvotes: 1

rodrigo
rodrigo

Reputation: 98436

I think you have literal slashes in your string, not unicode characters.

That is, \u00ea is the unicode escape encoding for ê, but \\u00ea is actually a slash (escaped), two zeros and two letters.

Similar for the quotation marks, your first and last characters are literal double quotes ".

You can convert those slash plus codepoint into their equivalent characters with:

x = '"temp\\u00eate de poussi\\u00e8res"'
d = x.decode("unicode_escape")
print d

The output is:

"tempête de poussières"

Note that to see the proper international characters you have to use print. If instead you just write d in the interactive Python shell you get:

 u'"temp\xeate de poussi\xe8res"'

where \xea is equivalent as \u00ea, that is the escape sequence for ê.

Removing the quotes, if required, is left as an exercise to the reader ;-).

Upvotes: 2

Related Questions