Alexandros Marinos
Alexandros Marinos

Reputation: 1396

decoding problem with urllib2 in python

I'm trying to use urllib2 in python 2.7 to fetch a page from the web. The page happens to be encoded in unicode(UTF-8) and have greek characters. When I try to fetch and print it with the code below, I get gibberish instead of the greek characters.

import urllib2
print urllib2.urlopen("http://www.pamestihima.gr").read()

The result is the same both in Netbeans 6.9.1 and in Windows 7 CLI.

I'm doing something wrong, but what?

Upvotes: 0

Views: 1834

Answers (2)

knitti
knitti

Reputation: 7033

  1. Unicode is not UTF-8. UTF-8 is a string encoding, like ISO-8859-1, ASCII etc.

  2. Always decode your data as soon as possible, to make real Unicode out of it. ('somestring in utf8'.decode('utf-8') == u'somestring in utf-8'), unicode objects are u'' , not ''

  3. When you have data leaving your app, always encode it in the proper encoding. For Web stuff this is utf-8mostly. For console stuff this is whatever your console encoding is. On Windows this is not UTF-8 by default.

Upvotes: 3

Steve Tjoa
Steve Tjoa

Reputation: 61024

It prints correctly for me, too.

Check the character encoding of the program in which you are viewing the HTML source code. For example, in a Linux terminal, you can find "Set Character Encoding" and make sure it is UTF-8.

Upvotes: 1

Related Questions