gorodechnyj
gorodechnyj

Reputation: 691

Django shell encoding error (Debian only, Ubuntu fine)

Good day

Can somebody explain what is going on behind the Django manage.py shell console? The problem is following. I'm developing a Django app, which is using an urllib to parse some html pages to get some info from them. And that info is in russian language, so it should be unicode (this is address string in this case). Next, my script feeds this to some other third-party module which falls, because it is not valid unicode string (I'm trying to geodecode point from address). I tried to print the string (parsed address in this case) to console with print address command but it fails:

File "<console>", line 1, in <module>
... some useless stacktrace ...    
    print address
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

Now comes the interesting part.
I have 2 computers: workstation with Ubuntu and Python 2.7.2 and Debian Lenny VPS with Python 2.7.2. I start parser the same way on both machines: by executing python manage.py shell and calling my function from it.
First I got the same error on both installations, but then I noticed that my python encoding is set to 'ascii' (import sys; sys.getdefaultencoding()). And when I put

import sys; reload(sys).setdefaultencoding('utf-8')

into settings.py the problem solves for Ubuntu. Now I get proper print on it, e.g. г. Челябинск, ул. Кирова, д. 27, КТК Набережный, but this is not working for Debian.

If i delete this print address string than, I get non-readable geolocation errors, but again - only on Debian. Ubuntu is working just fine:

Failed to geodecode address [г. ЧелÑбинÑк, Ñл. 1-ой ÐÑÑилеÑки, 17/1, ÑÑнок ÐÑÑак, 1-з]

No amount of unicode(address).encode('utf-8') magic can help this.

So I just can't get it. What's the differences between machines that cause me so much trouble?

Upvotes: 2

Views: 1335

Answers (1)

ed.
ed.

Reputation: 1393

If you run the following python script, you'll see what's happening:

# -*- coding: utf-8 -*-
a = r"Челябинск"
print "Encode from UTF-8 to UTF-8:",unicode(a,'utf-8').encode('utf-8')
print "Encode from ISO8859-1 to UTF-8:",unicode(a,'iso8859-1').encode('utf-8')

The output is:

Encode from ISO8859-1 to UTF-8: Челябинск

Encode from ISO8859-1 to UTF-8: ЧелÑбинÑк

In essence you're taking a string encoded (already) as UTF-8 and re-encoding it (a second time, as if it were ISO8859-1) into UTF-8.

It's worth checking what the default encoding of the machine is in each case.

If anyone can add to this answer then please do.

Upvotes: 3

Related Questions