Alexander Egurnov
Alexander Egurnov

Reputation: 177

Working with unicode in python

I connect to a mysql database using pymysql and after executing a request I got the following string: \xd0\xbc\xd0\xb0\xd1\x80\xd0\xba\xd0\xb0.

This should be 5 characters in utf8, but when I do print s.encode('utf-8') I get this: ╨╝╨░╤А╨║╨░. The string looks like byte representation of unicode characters, which python fails to recognize.

So what do I do to make python process them properly?

Upvotes: 2

Views: 4168

Answers (2)

Mark Byers
Mark Byers

Reputation: 837926

You want to decode (not encode) to get a unicode string from a byte string.

>>> s = '\xd0\xbc\xd0\xb0\xd1\x80\xd0\xba\xd0\xb0'
>>> us = s.decode('utf-8')
>>> print us
марка

Note that you may not be able to print it because it contains characters outside ASCII. But you should be able to see its value in a Unicode-aware debugger. I ran the above in IDLE.

Update

It seems what you actually have is this:

>>> s = u'\xd0\xbc\xd0\xb0\xd1\x80\xd0\xba\xd0\xb0'

This is trickier because you first have to get those bytes into a bytestring before you call decode. I'm not sure what the "best" way to do that is, but this works:

>>> us = ''.join(chr(ord(c)) for c in s).decode('utf-8')
>>> print us
марка

Note that you should of course be decoding it before you store it in the database as a string.

Upvotes: 5

Ned Batchelder
Ned Batchelder

Reputation: 375484

Mark is right: you need to decode the string. Byte strings become Unicode strings by decoding them, encoding goes the other way. This and many other details are at Pragmatic Unicode, or, How Do I Stop The Pain?.

Upvotes: 4

Related Questions