Reputation: 373
I've got some real problems to encode/decode strings to a specific charset (UTF-8).
My Unicode Object is:
>> u'Valor Econ\xf4mico - Opini\xe3o'
When I call print from python it returns:
>> Valor Econômico - Opinião
When I call .encode("utf-8") from my unicode object to write it to a JSON it returns:
>> 'Valor Econ\xc3\xb4mico - Opini\xc3\xa3o'
What am I doing wrong? What exactly is print() doing that I'm not?
Obs: I'm creating this unicode object from a line of a file.
import codecs
with codecs.open(path, 'r') as local_file:
for line in local_file:
obj = unicode((line.replace(codecs.BOM_UTF8, '')).replace('\n', ''), 'utf-8')
Upvotes: 2
Views: 279
Reputation: 27744
Valor Econ\xc3\xb4mico - Opini\xc3\xa3o
is the UTF-8 representation prepared for a non-UTF-8 terminal, probably in the interactive shell. If you were to write this to a file (open("myfile", "wb").write("Valor Econ\xc3\xb4mico - Opini\xc3\xa3o"
) then you'd have a valid UTF-8 file.
To create Unicode strings from a file, you can use automatic decoding in the io module (Codecs.open()
is being deprecated). BOMs will be removed automatically:
import io
with io.open(path, "r", encoding="utf-8") as local_file:
for line in local_file:
unicode_obj = line.strip()
When it comes to creating a JSON response, use the result from json.dumps(my_object)
. It will return an str with all non-ASCII chars encoded using Unicode codepoints.
Upvotes: 1