Reputation: 15119
I have this list of strings..
mylist = [u"čeština", u"maďarština", u"francouština"]
I need to dump it into a file, I am using JSON for that
text = json.dumps(mylist)
FILE = open("file.txt", 'w')
FILE.write(text)
FILE.close()
But when I open the file in editor(with utf-8) I see
["\u010de\u0161tina", "ma\u010far\u0161tina", "francou\u0161tina"]
When I read the list from the file, I get right values. But this file should be displayed also user-friedly, also I expect..
["čeština", "maďarština", "francouština"]
or at least
[u"čeština", u"maďarština", u"francouština"]
Upvotes: 1
Views: 178
Reputation: 208715
When you do json.dumps([u"čeština", u"maďarština", u"francouština"])
you will get the string '["\\u010de\u0161tina", "ma\\u010far\u0161tina", "francou\\u0161tina"]'
(using valid Python string literal form). The \u
escapes are how Unicode is represented in JSON, and Python's JSON module will convert all non-ascii characters to Unicode escapes by default. You can disable this behavior by using ensure_ascii=False
in the json.dumps()
arguments.
Here are a few examples, first the default behavior:
>>> json.dumps(lst)
'["\\u010de\\u0161tina", "ma\\u010far\\u0161tina", "francou\\u0161tina"]'
>>> print json.dumps(lst)
["\u010de\u0161tina", "ma\u010far\u0161tina", "francou\u0161tina"]
And with ensure_ascii=False
:
>>> json.dumps(lst, ensure_ascii=False)
u'["\u010de\u0161tina", "ma\u010far\u0161tina", "francou\u0161tina"]'
>>> print json.dumps(lst, ensure_ascii=False)
["čeština", "maďarština", "francouština"]
Now to make sure you are writing this Unicode string using utf-8 you can use the codecs module:
import codecs, json
lst = [u"čeština", u"maďarština", u"francouština"]
json.dump(lst, codecs.open('file.txt', 'w', 'utf-8'), ensure_ascii=False)
Note that I also used json.dump()
which writes directly to a file instead of json.dumps()
.
Upvotes: 6
Reputation: 191819
u"čeština
is not valid JSON. As far as I know, you can't have multibyte characters in JSON strings either (i.e. it is also invalid), but I can't back that up.
["\u010de\u0161tina"]
is valid JSON. When it is parsed the UTF-8 characters can be decoded from the \u
parts. For some peace of mind, open your browsers console, type "\u010de\u0161tina"
and hit Enter and see the string that is printed.
Upvotes: 4