Reputation: 6372
I'm crawling a website in python. The website text is in iso-8859-1
. Once I read the HTML, I convert the text to utf-8 as so:
pageHTML = urllib.urlopen( url ).read()
pageHTML = pageHTML.decode('iso-8859-1').encode('utf8')
I do some processing on the HTML, and store some tokens in an array. Then I dump the arrays as json into a file as so:
with open(outputPath, 'w') as f:
json.dump(tokens, f)
However, when I open the dumped file with a text editor I don't see real characters, but I see unicode code points as the following:
"hei\u00dfen"
which should be displayed as "heißen".
My questions:
1- why is that?
2- how to solve it?
The text editor is Atom. But also tried TextEdit on OS X.
Upvotes: 2
Views: 145
Reputation: 532003
The default for json.dump
is to assume ASCII output, meaning any non-ASCII Unicode character is represented using the \uxxxx
notation. To change that set the ensure_ascii
option to False
. Some examples using dumps
:
>>> print json.dumps("foö")
"fo\u00f6"
>>> print json.dumps("foö", ensure_ascii=False)
"foö"
Upvotes: 3