Understanding encoding and decoding in Python

Question

I'm looking around how works encoding in python 2.7, and I can't quite understand some aspects of it. I've worked with files with different encodings, and yet so far I was doing okay. Until I started to work with certain API, and it requires to work with Unicode strings

u'text'

and I was using Normal strings

'text'

Which araised a lot of problems.

So I want to know how to go from Unicode String to Normal String and backwards, because the data that I'm working with is handled by Normal Strings, and I only know how to get the Unicode ones without having issues, over the Python Shell.

What I've tried is:

>>> foo = "gurú"
>>> bar = u"gurú"
>>> foo
'gur\xa3'
>>> bar
u'gur\xfa'

Now, to get an Unicode string what I do is:

>>> foobar = unicode(foo, "latin1")
u'gur\xa3'

But this doesn't work for me, since I'm doing some comparisons in my code like this:

>>> foobar in u"Foo gurú Bar"
False

Which fails, even if the original value is the same, because of the encoding.

[Edit]

I'm using Python Shell on Windows 10.

S. Tyr · Accepted Answer

The windows terminal uses legacy code pages for DOS. For US Windows it is:

>>> import sys
>>> sys.stdout.encoding
'cp437'

Windows application use windows code pages. Python's IDLE will show the windows encoding:

>>> import sys
>>> sys.stdout.encoding
'cp1252'

Your results may vary!... Source

So if you want to go from normal String to Unicode and backwards. Then first you have to findout the encoding of your system, which is used for normal Strings in Python 2.X. And later on, use it to make the proper conversion.

I leave you with an example:

>>> import sys
>>> sys.stdout.encoding
'cp850'
>>>
>>> foo = "gurú"
>>> bar = u"gurú"
>>> foo
'gur\xa3'
>>> bar
u'gur\xfa'
>>>
>>> foobar = unicode(foo, 'cp850')
u'gur\xfa'
>>>
>>> foobar in u"Foo gurú Bar"
True

Understanding encoding and decoding in Python

Answers (1)

Related Questions