timpone
timpone

Reputation: 19969

python noob question about codecs and utf-8

Using python to pick it some pieces so definitely a noob ? here but didn't seeing a satisfactory answer.

I have a json utf-8 file with some pieces that have grave's, accute's etc.... I'm using codecs and have (for example):

str=codecs.open('../../publish_scripts/locations.json', 'r','utf-8')
locations=json.load(str)

for location in locations:
    print location['name']

For print'ing, does anything special need to be done? It's giving me the following
ascii' codec can't encode character u'\xe9' in position 5

It looks like the correct utf-8 value for e-accute. I suspect I'm doing something wrong with print'ing. Would the iteration cause it to lose it's utf-8'ness?

PHP and Ruby versions handle the utf-8 piece fine; is there some looseness in those languages that python won't do?

thx

Upvotes: 1

Views: 287

Answers (4)

Tobu
Tobu

Reputation: 25426

The standard io streams are broken for non-ascii, character io in python2 and some site.py setups. Basically, you need to sys.setdefaultencoding('utf8') (or whatever the system locale's encoding is) very early in your script. With the site.py shipped in ubuntu, you need to imp.reload(sys) to make sys.setdefaultencoding available. Alternatively, you can wrap sys.stdout (and stdin and stderr) to be unicode-aware readers/writers, which you can get from codecs.getreader / getwriter.

Upvotes: 0

Rob Cowie
Rob Cowie

Reputation: 22619

codec.open() will decode the contents of the file using the codec you supplied (utf-8). You then have a python unicode object (which behaves similarly to a string object).

Printing a unicode object will cause an implict (behind-the-scenes) encode using the default codec, which is usually ascii. If ascii cannot encode all of the characters present it will fail.

To print it, you should first encode it, thus:

for location in locations:
    print location['name'].encode('utf8')

EDIT:

For your info, json.load() actually takes a file-like object (which is what codecs.open() returns). What you have at that point is neither a string nor a unicode object, but an iterable wrapper around the file.

By default json.load() expects the file to be utf8 encoded so your code snippet can be simplified:

locations = json.load(open('../../publish_scripts/locations.json'))
for location in locations:
    print location['name'].encode('utf8')

Upvotes: 3

Martin Vilcans
Martin Vilcans

Reputation: 5718

You're probably reading the file correctly. The error occurs when you're printing. Python tries to convert the unicode string to ascii, and fails on the character in position 5.

Try this instead:

print location['name'].encode('utf-8')

If your terminal is set to expect output in utf-8 format, this will print correctly.

Upvotes: 2

It's the same as in PHP. UTF8 strings are good to print.

Upvotes: 0

Related Questions