Unfamiliar encoding in Python

Question

I am trying to create a binary converter with Python, but I encounter some strange codes:

>>> print '\x97'
—
>>> print '\x96'
–
>>> print '\x94'
”
>>> print '\x95'
•

What is that encoding called?

John Machin · Accepted Answer

That encoding could be ANY of the nine Windows single-byte "ANSI" encodings, cp1250 to cp1258 inclusive:

>>> guff = "\x97\x96\x94\x95"
>>> uguff0 = guff.decode('1250')
>>> all(guff.decode(str(e)) == uguff0 for e in xrange(1251, 1259))
True

Usage:

1250: Central/Eastern Europe languages with Latin-based alphabets e.g. Polish, Czech, Slovak, Hungarian
1251: Cyrillic alphabet e.g. Russian
1252: Western European languages with Latin-based alphabets
The others are single-language encodings for Turkish, Greek, Hebrew, Arabic, and Vietnamese.

To find out what is in use on your computer:

>>> import locale
>>> locale.getpreferredencoding()
'cp1252'

Here's what the codes mean:

>>> from unicodedata import name
>>> for c in uguff0:
...     print repr(c), name(c)
...
u'\u2014' EM DASH
u'\u2013' EN DASH
u'\u201d' RIGHT DOUBLE QUOTATION MARK
u'\u2022' BULLET
>>>

Unfamiliar encoding in Python

Answers (2)

Related Questions