Reputation: 280
I created a file containing a dictionary with data written in Spanish (i.e. Damián, etc.):
fileNameX.write(json.dumps(dictionaryX, indent=4))
The data come from some fql fetching operations, i.e.:
select name from user where uid in XXX
When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n". I've tried some options:
ensure_ascii=False:
fileNameX.write(json.dumps(dictionaryX, indent=4, ensure_ascii=False))
But I get an error (UnicodeEncodeError: 'ascii' codec can´t encode character u'\xe1' in position XXX: ordinal not in range(128)).
encode(encoding='latin-1):
dictionaryX.append({
'name': unicodeVar.encode(encoding='latin-1'),
...
})
But I get another error (UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position XXX: invalid continuation byte)
To sum up, I've tried several possibilities, but have less than a clue. I'm lost. Please, I need help. Thanks!
Upvotes: 3
Views: 1342
Reputation: 799110
Use codecs.open()
to open fileNameX
with a specific encoding like encoding='utf-8'
for example instead of using open()
.
Also, json.dump()
.
Upvotes: 2
Reputation: 308402
Since the string has a \u
inside it that means it's a Unicode string. The string is actually correct! Your problem lies in displaying the string. If you print
the string, Python's output encoding should print it in the proper encoding for your environment.
For example, this is what I get inside IDLE on Windows:
>>> print u'Dami\u00e1n'
Damián
Upvotes: 0
Reputation: 61615
You have many options, and have stumbled upon something rather complicated that depends on your Python version and which you absolutely must understand fully in order to write correct code. Generally the approach taken in 3.x is stricter and a bit harder to work with, but it is much less likely that you will make a mistake or get yourself into a complicated situation. (Based on the exact symptoms you report, you seem to be using 2.x.)
json.dumps
has different behaviour in 2.x and 3.x. In 2.x, it produces a str
, which is a byte-string (unknown encoding). In 3.x, it still produces a str
, but now str
in 3.x is a proper Unicode string.
JSON is inherently a Unicode-supporting format, but it expects files to be in UTF-8 encoding. However, please understand that JSON supports \u
style escapes in strings. When you read in this data, you will get the correct encoded string back. The reading code produces unicode objects (no matter whether you use 2.x or 3.x) when it reads strings out of the JSON.
When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n"
á
cannot be represented in ASCII. It gets encoded as \u00e1
by default, to avoid the other problems you had. This happens even in 3.x.
ensure_ascii=False
This disables the previous encoding. In 2.x, it means you get a unicode
object instead - a real Unicode object, preserving the original á
character. In 3.x, it means that the character is not explicitly translated. But either way, ensure_ascii=False
means that json.dumps
will give you a Unicode string.
Unicode strings must be encoded to be written to a file. There is no such thing as "unicode data"; Unicode is an abstraction. In 2.x, this encoding is implicitly 'ascii'
when you feed a Unicode object to file.write
; it was expecting a str
. To get around this, you can use the codecs
module, or explicitly encode as 'utf-8'
before writing. In 3.x, the encoding is set with the encoding
keyword argument when you open
the file (the default is again probably not what you want).
encode(encoding='latin-1')
Here, you are encoding before producing the dictionary, so that you have a str
object in your data. Now a problem occurs because when there are str
objects in your data, the JSON encoder assumes, by default, that they represent Unicode strings in UTF-8 encoding. This can be changed, in 2.x, using the encoding
keyword argument to json.dumps
. (In 3.x, the encoder will simply refuse to serialize bytes
objects, i.e. non-Unicode strings!)
However, if your goal is simply to get the data into the file directly, then json.dumps
is the wrong tool for you. Have you wondered what that s
in the name is for? It stands for "string"; this is the special case. The ordinary case, in fact, is writing directly to a file! (Instead of giving you a string and expecting you to write it yourself.) Which is what json.dump
(no 's') does. Again, the JSON standard expects UTF-8 encoding, and again 2.x has an encoding
keyword parameter that defaults to UTF-8 (you should leave this alone).
Upvotes: 2