aadah
aadah

Reputation: 81

Unfamiliar encoding in Python

I am trying to create a binary converter with Python, but I encounter some strange codes:

>>> print '\x97'
—
>>> print '\x96'
–
>>> print '\x94'
”
>>> print '\x95'
•

What is that encoding called?

Upvotes: 0

Views: 563

Answers (2)

John Machin
John Machin

Reputation: 82992

That encoding could be ANY of the nine Windows single-byte "ANSI" encodings, cp1250 to cp1258 inclusive:

>>> guff = "\x97\x96\x94\x95"
>>> uguff0 = guff.decode('1250')
>>> all(guff.decode(str(e)) == uguff0 for e in xrange(1251, 1259))
True

Usage:

1250: Central/Eastern Europe languages with Latin-based alphabets e.g. Polish, Czech, Slovak, Hungarian
1251: Cyrillic alphabet e.g. Russian
1252: Western European languages with Latin-based alphabets
The others are single-language encodings for Turkish, Greek, Hebrew, Arabic, and Vietnamese.

To find out what is in use on your computer:

>>> import locale
>>> locale.getpreferredencoding()
'cp1252'

Here's what the codes mean:

>>> from unicodedata import name
>>> for c in uguff0:
...     print repr(c), name(c)
...
u'\u2014' EM DASH
u'\u2013' EN DASH
u'\u201d' RIGHT DOUBLE QUOTATION MARK
u'\u2022' BULLET
>>>

Upvotes: 2

paxdiablo
paxdiablo

Reputation: 882078

That would be hex encoding. It means take the hex value 97, which is 151 in decimal, and use that character inside the string.

Character 151 is the em-dash, 150 is the en-dash, 148 is the end-double-quote and 149 is the bullet point, as shown here, keeping in mind that these characters are not Unicode code points (as stated) but Windows code page characters.

Upvotes: 1

Related Questions