PepperoniPizza
PepperoniPizza

Reputation: 9112

unicode characters not showing properly

I crawled a set of sites and extracted different strings with unicode encoded characters such as 'Best places to eat in D\xfcsseldorf'. I have them stored as showed in a PostgreSQL database. When I retrieve strings that the mentioned earlier from Database and do:

name = string_retrieved_from_database
print name

outputs as unicode u'Best places to eat in D\xfcsseldorf'. I want to display the string as it should be: 'Best places to eat in Düsseldorf'. How can I do that.

Upvotes: 0

Views: 1053

Answers (2)

BrenBarn
BrenBarn

Reputation: 251398

Are you sure you get output when you print the variable, instead of just displaying it interactively? You should never get the u'...' display when using print:

>>> x = b"Best places to eat in D\xfcsseldorf"
>>> x.decode('latin-1')
u'Best places to eat in D\xfcsseldorf'
>>> print x.decode('latin-1')
Best places to eat in Düsseldorf

If you're getting the backslash and so forth in the actual string, then it's possible something went wrong at the encoding stage (e.g., literal backslashes were written into the text). In that case you may want to look at the "unicode-escape" codec:

>>> x = b"Best places to eat in D\\xfcsseldorf"
>>> print x
Best places to eat in D\xfcsseldorf
>>> print x.decode('unicode-escape')
Best places to eat in Düsseldorf

Upvotes: 3

Ned Batchelder
Ned Batchelder

Reputation: 375604

You need to deal with the encodings as quickly as possible. The best thing is to read the HTML page, decode the byte strings you get into Unicode, and then store the strings as Unicode in the database, or at least in a uniform encoding like UTF8.

If you need help with the details, Pragmatic Unicode, or, How Do I Stop The Pain has them all.

Upvotes: 3

Related Questions