Rodrigo Ney
Rodrigo Ney

Reputation: 373

Encoding - Problems creating JSON using an Unicode Object in Python

I've got some real problems to encode/decode strings to a specific charset (UTF-8).

My Unicode Object is:

>> u'Valor Econ\xf4mico - Opini\xe3o'

When I call print from python it returns:

>> Valor Econômico - Opinião

When I call .encode("utf-8") from my unicode object to write it to a JSON it returns:

>> 'Valor Econ\xc3\xb4mico - Opini\xc3\xa3o'

What am I doing wrong? What exactly is print() doing that I'm not?

Obs: I'm creating this unicode object from a line of a file.

import codecs
with codecs.open(path, 'r') as local_file:
    for line in local_file:
        obj = unicode((line.replace(codecs.BOM_UTF8, '')).replace('\n', ''), 'utf-8')

Upvotes: 2

Views: 279

Answers (1)

Alastair McCormack
Alastair McCormack

Reputation: 27744

Valor Econ\xc3\xb4mico - Opini\xc3\xa3o is the UTF-8 representation prepared for a non-UTF-8 terminal, probably in the interactive shell. If you were to write this to a file (open("myfile", "wb").write("Valor Econ\xc3\xb4mico - Opini\xc3\xa3o") then you'd have a valid UTF-8 file.

To create Unicode strings from a file, you can use automatic decoding in the io module (Codecs.open() is being deprecated). BOMs will be removed automatically:

import io
with io.open(path, "r", encoding="utf-8") as local_file:
    for line in local_file:
        unicode_obj = line.strip()

When it comes to creating a JSON response, use the result from json.dumps(my_object). It will return an str with all non-ASCII chars encoded using Unicode codepoints.

Upvotes: 1

Related Questions