Meloun
Meloun

Reputation: 15119

json and encode strings

I have this list of strings..

mylist = [u"čeština", u"maďarština", u"francouština"]

I need to dump it into a file, I am using JSON for that

text = json.dumps(mylist)
FILE = open("file.txt", 'w')
FILE.write(text)
FILE.close()

But when I open the file in editor(with utf-8) I see

["\u010de\u0161tina", "ma\u010far\u0161tina", "francou\u0161tina"]

When I read the list from the file, I get right values. But this file should be displayed also user-friedly, also I expect..

["čeština", "maďarština", "francouština"]

or at least

[u"čeština", u"maďarština", u"francouština"]

Upvotes: 1

Views: 178

Answers (2)

Andrew Clark
Andrew Clark

Reputation: 208715

When you do json.dumps([u"čeština", u"maďarština", u"francouština"]) you will get the string '["\\u010de\u0161tina", "ma\\u010far\u0161tina", "francou\\u0161tina"]' (using valid Python string literal form). The \u escapes are how Unicode is represented in JSON, and Python's JSON module will convert all non-ascii characters to Unicode escapes by default. You can disable this behavior by using ensure_ascii=False in the json.dumps() arguments.

Here are a few examples, first the default behavior:

>>> json.dumps(lst)
'["\\u010de\\u0161tina", "ma\\u010far\\u0161tina", "francou\\u0161tina"]'
>>> print json.dumps(lst)
["\u010de\u0161tina", "ma\u010far\u0161tina", "francou\u0161tina"]

And with ensure_ascii=False:

>>> json.dumps(lst, ensure_ascii=False)
u'["\u010de\u0161tina", "ma\u010far\u0161tina", "francou\u0161tina"]'
>>> print json.dumps(lst, ensure_ascii=False)
["čeština", "maďarština", "francouština"]

Now to make sure you are writing this Unicode string using utf-8 you can use the codecs module:

import codecs, json
lst = [u"čeština", u"maďarština", u"francouština"]
json.dump(lst, codecs.open('file.txt', 'w', 'utf-8'), ensure_ascii=False)

Note that I also used json.dump() which writes directly to a file instead of json.dumps().

Upvotes: 6

Explosion Pills
Explosion Pills

Reputation: 191819

u"čeština is not valid JSON. As far as I know, you can't have multibyte characters in JSON strings either (i.e. it is also invalid), but I can't back that up.

["\u010de\u0161tina"] is valid JSON. When it is parsed the UTF-8 characters can be decoded from the \u parts. For some peace of mind, open your browsers console, type "\u010de\u0161tina" and hit Enter and see the string that is printed.

Upvotes: 4

Related Questions