I created a file containing a dictionary with data written in Spanish (i.e. Damián , etc.): fileNameX.write(json.dumps(dictionaryX, indent=4)) The data come from some fql fetching operations, i.e.: select name from user where uid in XXX When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n". I've tried some options: ensure_ascii=False: fileNameX.write(json.dumps(dictionaryX, indent=4, ensure_ascii=False)) But I get an error ( UnicodeEncodeError: 'ascii' codec can´t encode character u'\xe1' in position XXX: ordinal not in range(128) ). encode(encoding='latin-1): dictionaryX.append({ 'name': unicodeVar.encode(encoding='latin-1'), ... }) But I get another error ( UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position XXX: invalid continuation byte ) To sum up, I've tried several possibilities, but have less than a clue. I'm lost. Please, I need help. Thanks!

Python & fql: getting "Dami\u00e1n" instead of "Damián"

Reputation: 799110

Use codecs.open() to open fileNameX with a specific encoding like encoding='utf-8' for example instead of using open().

Also, json.dump().

Upvotes: 2

Mark Ransom

Reputation: 308402

Since the string has a \u inside it that means it's a Unicode string. The string is actually correct! Your problem lies in displaying the string. If you print the string, Python's output encoding should print it in the proper encoding for your environment.

For example, this is what I get inside IDLE on Windows:

>>> print u'Dami\u00e1n'
Damián

Upvotes: 0

Karl Knechtel

Reputation: 61615

You have many options, and have stumbled upon something rather complicated that depends on your Python version and which you absolutely must understand fully in order to write correct code. Generally the approach taken in 3.x is stricter and a bit harder to work with, but it is much less likely that you will make a mistake or get yourself into a complicated situation. (Based on the exact symptoms you report, you seem to be using 2.x.)

json.dumps has different behaviour in 2.x and 3.x. In 2.x, it produces a str, which is a byte-string (unknown encoding). In 3.x, it still produces a str, but now str in 3.x is a proper Unicode string.

JSON is inherently a Unicode-supporting format, but it expects files to be in UTF-8 encoding. However, please understand that JSON supports \u style escapes in strings. When you read in this data, you will get the correct encoded string back. The reading code produces unicode objects (no matter whether you use 2.x or 3.x) when it reads strings out of the JSON.

When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n"

á cannot be represented in ASCII. It gets encoded as \u00e1 by default, to avoid the other problems you had. This happens even in 3.x.

ensure_ascii=False

This disables the previous encoding. In 2.x, it means you get a unicode object instead - a real Unicode object, preserving the original á character. In 3.x, it means that the character is not explicitly translated. But either way, ensure_ascii=False means that json.dumps will give you a Unicode string.

Unicode strings must be encoded to be written to a file. There is no such thing as "unicode data"; Unicode is an abstraction. In 2.x, this encoding is implicitly 'ascii' when you feed a Unicode object to file.write; it was expecting a str. To get around this, you can use the codecs module, or explicitly encode as 'utf-8' before writing. In 3.x, the encoding is set with the encoding keyword argument when you open the file (the default is again probably not what you want).

encode(encoding='latin-1')

Here, you are encoding before producing the dictionary, so that you have a str object in your data. Now a problem occurs because when there are str objects in your data, the JSON encoder assumes, by default, that they represent Unicode strings in UTF-8 encoding. This can be changed, in 2.x, using the encoding keyword argument to json.dumps. (In 3.x, the encoder will simply refuse to serialize bytes objects, i.e. non-Unicode strings!)

However, if your goal is simply to get the data into the file directly, then json.dumps is the wrong tool for you. Have you wondered what that s in the name is for? It stands for "string"; this is the special case. The ordinary case, in fact, is writing directly to a file! (Instead of giving you a string and expecting you to write it yourself.) Which is what json.dump (no 's') does. Again, the JSON standard expects UTF-8 encoding, and again 2.x has an encoding keyword parameter that defaults to UTF-8 (you should leave this alone).

Upvotes: 2

Python & fql: getting "Dami\u00e1n" instead of "Damián"

Answers (3)

Related Questions

Python &amp; fql: getting &quot;Dami\u00e1n&quot; instead of &quot;Dami&#225;n&quot;

Answers (3)

Related Questions

Python & fql: getting "Dami\u00e1n" instead of "Damián"