Reputation: 130
I never really understood how encoding and decoding works in python and I am used to come across this type of problems frequently. I have to read a json file and compare some of its values with other data.
In one of the files I have the string BAIXA DA INSCRI\u00c7\u00c3O ESTADUAL
which should become BAIXA DA INSCRICAO ESTADUAL
. I am reading the file like this:
with codecs.open(filepath, 'r') as file:
filedata = json.loads(file.read())
However the string is read as unicode and represented like u'BAIXA DA INSCRI\xc7\xc3O ESTADUAL'
How can I make this happen, and how is the proper way to work with codecs in python?
Upvotes: 0
Views: 259
Reputation: 148910
It look like you want to remove any diacritics from your text. You can try to use the normal form D (for decomposed) of unicode and filter out high codes:
txt = u'BAIXA DA INSCRI\xc7\xc3O ESTADUAL'
txt = u''.join(i for i in unicodedata.normalize('NFD', t) if ord(i) < 128).encode('ASCII')
It should give the (byte) string:
'BAIXA DA INSCRICAO ESTADUAL'
Upvotes: 1