Jack Twain
Jack Twain

Reputation: 6372

Why the text editor shows some characters as unicode code points?

I'm crawling a website in python. The website text is in iso-8859-1. Once I read the HTML, I convert the text to utf-8 as so:

pageHTML = urllib.urlopen( url ).read()
pageHTML = pageHTML.decode('iso-8859-1').encode('utf8')

I do some processing on the HTML, and store some tokens in an array. Then I dump the arrays as json into a file as so:

with open(outputPath, 'w') as f:
    json.dump(tokens, f)

However, when I open the dumped file with a text editor I don't see real characters, but I see unicode code points as the following:

"hei\u00dfen"

which should be displayed as "heißen".

My questions:

1- why is that?

2- how to solve it?

The text editor is Atom. But also tried TextEdit on OS X.

Upvotes: 2

Views: 145

Answers (1)

chepner
chepner

Reputation: 532003

The default for json.dump is to assume ASCII output, meaning any non-ASCII Unicode character is represented using the \uxxxx notation. To change that set the ensure_ascii option to False. Some examples using dumps:

>>> print json.dumps("foö")
"fo\u00f6"
>>> print json.dumps("foö", ensure_ascii=False)
"foö"

Upvotes: 3

Related Questions