Reputation: 1869
I want to convert a unicode input to a \x escaped, 7-bit-ascii-clean reprentation of a UTF-8 byte sequence.
This is analogous to what I need, but instead of "\u222a"
I would like to generate "\xe2\x88\xaa"
>>> codecs.encode(u"\u222A", 'ascii', 'backslashreplace')
'\\u222a'
This looks like it is generating the desired result:
>>> u"\u222A".encode('utf-8')
'\xe2\x88\xaa'
But that is merely an escaped representation. The actual result isn't 12 ascii bytes, it's 3 UTF-8 bytes:
>>> [ord(c) for c in u"\u222A".encode('utf-8')]
[226, 136, 170]
I could abuse that escaped representation to get what I want, stripping off the leading and trailing quote that repr adds:
>>> repr(u"\u222A".encode('utf-8'))[1:-1]
'\\xe2\\x88\\xaa'
>>> [ord(c) for c in repr(u"\u222A".encode('utf-8'))[1:-1]]
[92, 120, 101, 50, 92, 120, 56, 56, 92, 120, 97, 97]
Yuck. This is a little better:
>>> import binascii
>>> ''.join('\\x' + binascii.hexlify(c) for c in u"\u222A".encode('utf-8'))
'\\xe2\\x88\\xaa'
>>> [ord(c) for c in ''.join('\\x' + binascii.hexlify(c) for c in u"\u222A".encode('utf-8'))]
[92, 120, 101, 50, 92, 120, 56, 56, 92, 120, 97, 97]
Is a better way to do this?
Upvotes: 2
Views: 4637
Reputation: 798814
>>> u'\u222A'.encode('utf-8').encode('string-escape')
'\\xe2\\x88\\xaa'
>>> print u'\u222A'.encode('utf-8').encode('string-escape')
\xe2\x88\xaa
Upvotes: 2
Reputation: 308276
I don't think you'll find a solution that isn't ugly. Here's one that retains any ASCII characters that are in the original string without converting them to a hex sequence.
''.join(c if 32 <= ord(c) <= 127 else '\\x{:02x}'.format(ord(c)) for c in u"\u222A".encode('utf-8'))
Upvotes: 0