Reputation: 24768
I have this:
>>> su = u'"/\"'
In python, how can I convert this to a representation that shows the unicode code points? That would be this for the string above
u'\u0022\u002F\u005C\u0022'
Upvotes: 0
Views: 743
Reputation: 414335
To support the full Unicode range, you could use unicode-escape
to get the text representation. To represent characters in the ascii range as the unicode escapes too and to force \u00xx
representation even for u'\xff'
, you could use a regex:
#!/usr/bin/env python2
import re
su = u'"/"\U000af600'
assert u'\ud800' not in su # no lone surrogate
print re.sub(ur'[\x00-\xff]', lambda m: u"\ud800u%04x" % ord(m.group()), su,
flags=re.U).encode('unicode-escape').replace('\\ud800', '\\')
a lone surrogate (U+d800) is used to avoid escaping the backslash twice.
\u0022\u002f\u0022\U000af600
Upvotes: 1
Reputation: 177735
Your original string is not four characters but three because \"
is an escape code for a double quote:
>>> su = u'"/\"'
>>> len(su)
3
Here's how to display it as escape codes:
>>> ''.join(u'\\u{:04X}'.format(ord(c)) for c in su)
u'\\u0022\\u002F\\u0022'
Use a Unicode raw string, or double backslashes to escape the slash and get four characters:
>>> su = ur'"/\"' # Raw version
>>> ''.join(u'\\u{:04X}'.format(ord(c)) for c in su)
u'\\u0022\\u002F\\u005C\\u0022'
>>> su = u'"/\\"' # Escaped version
>>> ''.join(u'\\u{:04X}'.format(ord(c)) for c in su)
u'\\u0022\\u002F\\u005C\\u0022'
Note the double backslash in the result. This indicates it is a single literal backslash. with one backslash, they would be escape codes...no different from your original string:
>>> ur'"/\"' == u'\u0022\u002F\u005C\u0022'
True
Printing it shows the content of the strings:
>>> print u'\u0022\u002F\u005C\u0022'
"/\"
>>> print(''.join(u'\\u{:04X}'.format(ord(c)) for c in su))
\u0022\u002F\u005C\u0022
Upvotes: 5