Reputation: 81
I am trying to create a binary converter with Python, but I encounter some strange codes:
>>> print '\x97'
—
>>> print '\x96'
–
>>> print '\x94'
”
>>> print '\x95'
•
What is that encoding called?
Upvotes: 0
Views: 563
Reputation: 82992
That encoding could be ANY of the nine Windows single-byte "ANSI" encodings, cp1250
to cp1258
inclusive:
>>> guff = "\x97\x96\x94\x95"
>>> uguff0 = guff.decode('1250')
>>> all(guff.decode(str(e)) == uguff0 for e in xrange(1251, 1259))
True
Usage:
1250: Central/Eastern Europe languages with Latin-based alphabets e.g. Polish, Czech, Slovak, Hungarian
1251: Cyrillic alphabet e.g. Russian
1252: Western European languages with Latin-based alphabets
The others are single-language encodings for Turkish, Greek, Hebrew, Arabic, and Vietnamese.
To find out what is in use on your computer:
>>> import locale
>>> locale.getpreferredencoding()
'cp1252'
Here's what the codes mean:
>>> from unicodedata import name
>>> for c in uguff0:
... print repr(c), name(c)
...
u'\u2014' EM DASH
u'\u2013' EN DASH
u'\u201d' RIGHT DOUBLE QUOTATION MARK
u'\u2022' BULLET
>>>
Upvotes: 2
Reputation: 882078
That would be hex encoding. It means take the hex value 97
, which is 151 in decimal, and use that character inside the string.
Character 151 is the em-dash, 150 is the en-dash, 148 is the end-double-quote and 149 is the bullet point, as shown here, keeping in mind that these characters are not Unicode code points (as stated) but Windows code page characters.
Upvotes: 1