converting Unicode code point numbers to Unicode characters

Question

I'm using the argparse library in Python 3 to read in Unicode strings from a command line parameter. Often those strings contain "ordinary" Unicode characters (extended Latin, etc.), but sometimes--particularly when the characters belong to a right-to-left script--it's easier to encode the strings as Unicode code points, like \u0644. But argparse treats these designators as a sequence of characters, and does not convert them into the character they designate. For instance, if a command line parameter is

... -a "abc\06d2d" ...

then what I get in the argparse variable is

"abc\06d2d"

rather than the expected

"abcےd"

(the character between the 'c' and 'd' is the yeh baree). Of course both outcomes are logical, it's just that the second one is the one I want.

I tried to reproduce this in an interpreter, but under most circumstances Python3 automagically converts a string like "abc\06d2d" into "abcےd". Not so when I read the string using argparse...

I came up with a function to do the conversion, see below. But I feel like I'm missing something much simpler. Is there an easier way to do this conversion? (Obviously I could use str.startswith(), or regex's to match the entire thing, rather than going character by character, but the code below is really just an illustration. It seems like I shouldn't have to create my own function to do this at all, especially since in some circumstances it seems to happen automatically.)

---------My code to do this follows---------

def ParseString2Unicode(sInString):
  """Return a version of sInString in which any Unicode code points of the form 
        \uXXXX (X = hex digit)  
     have been converted into their corresponding Unicode characters.
     Example:
         "\u0064b\u0065" 
     becomes
         "dbe"
  """
  sOutString = ""
  while sInString:
      if len(sInString) >= 6 and \
         sInString[0] == "\" and \
         sInString[1] == "u" and \
         sInString[2] in "0123456789ABCDEF" and \
         sInString[3] in "0123456789ABCDEF" and \
         sInString[4] in "0123456789ABCDEF" and \
         sInString[5] in "0123456789ABCDEF":
          #If we get here, the first 6 characters of sInString represent
          # a Unicode code point, like "\u0065"; convert it into a char:
          sOutString += chr(int(sInString[2:6], 16))
          sInString = sInString[6:]
      else:
          #Strip a single char:
          sOutString += sInString[0]
          sInString = sInString[1:]
  return sOutString

Artyer · Accepted Answer

What you may want to look at is the raw_unicode_escape encoding.

>>> len(b'\uffff')
6
>>> b'\uffff'.decode('raw_unicode_escape')
'\uffff'
>>> len(b'\uffff'.decode('raw_unicode_escape'))
1

So, the function would be:

def ParseString2Unicode(sInString):
    try:
        decoded = sInString.encode('utf-8')
        return decoded.decode('raw_unicode_escape')
    except UnicodeError:
        return sInString

This, however, also matches other unicode escape sequences, like \Uxxxxxxxx. If you just want to match \uxxxx, use a regex, like so:

import re

escape_sequence_re = re.compile(r'\u[0-9a-fA-F]{4}')

def _escape_sequence_to_char(match):
    return chr(int(match[0][2:], 16))

def ParseString2Unicode(sInString):
    return re.sub(escape_sequence_re, _escape_sequence_to_char, sInString)

converting Unicode code point numbers to Unicode characters

Answers (2)

Related Questions