Caio Alfonso
Caio Alfonso

Reputation: 130

Reading .json file and converting unicode data to utf-8

I never really understood how encoding and decoding works in python and I am used to come across this type of problems frequently. I have to read a json file and compare some of its values with other data.

In one of the files I have the string BAIXA DA INSCRI\u00c7\u00c3O ESTADUAL which should become BAIXA DA INSCRICAO ESTADUAL. I am reading the file like this:

with codecs.open(filepath, 'r') as file:
    filedata = json.loads(file.read())

However the string is read as unicode and represented like u'BAIXA DA INSCRI\xc7\xc3O ESTADUAL'

How can I make this happen, and how is the proper way to work with codecs in python?

Upvotes: 0

Views: 259

Answers (1)

Serge Ballesta
Serge Ballesta

Reputation: 148910

It look like you want to remove any diacritics from your text. You can try to use the normal form D (for decomposed) of unicode and filter out high codes:

txt = u'BAIXA DA INSCRI\xc7\xc3O ESTADUAL'
txt = u''.join(i for i in unicodedata.normalize('NFD', t) if ord(i) < 128).encode('ASCII')

It should give the (byte) string:

'BAIXA DA INSCRICAO ESTADUAL'

Upvotes: 1

Related Questions