adohertyd
adohertyd

Reputation: 2689

Processing Utf-8 data in python

I'm using the python module requests to get data from some API's and they all return json data which are converted to dicts. What I want to do is take some info from these dicts and either convert them all to python strings where I can use the stemming and string.translate() modules on them, or convert the whole thing to data that is recognisable to these modules. I can't do this with the UTF-8 data and it's doing my head in. Is there any solution to this at all? Can I iterate through the dict and convert it to ASCII?

The strange thing is I am comparing ASCII strings to the UTF data in other functions (if ASCII-word is in UTF dict: do something) and it works perfectly. The ASCII value matches the UTF-8 data all the time. I can't get my head around this encoding stuff at all

Upvotes: 2

Views: 432

Answers (2)

BrenBarn
BrenBarn

Reputation: 251428

UTF-8 is an extension of ASCII in that valid 7-bit ASCII text is also valid UTF-8 text, so if all the data is in fact representable in ASCII it doesn't make any difference whether it's ASCII or UTF-8.

If the data coming is UTF-8 encoded, the best approach is to decode it to unicode objects. For example if you read in a string from some source and store it in the variable utf8str, you can do utf8str.decode('utf-8'). Then pass this unicode object around and do all your operations on the unicode object. Instead of string.translate you can use unicode.translate (assuming you're referring to the string method called "translate" there).

If your modules cannot deal with unicode strings, you need to think about how you want to handle that. You have to decide what to do if your input contains characters that can't be represented in ASCII.

Upvotes: 3

mhermans
mhermans

Reputation: 2167

When you are sure the function does not support Unicode, you can always convert to an ASCII-approximation:

ascii_string = unicodedata.normalize('NFKD', unicode_string).encode('ascii','ignore') 

Upvotes: 0

Related Questions