How do I display the unicode syntax (\uXXXX) from a string?

I tried methods like encode using "unicode-display", raw string, But it didn't work. I am writing a module for a chat bot in python which involves getting a character from the user and showing it in the "\uXXXX" format instead of it turning into the respective character. Here's my code:

import discord
from discord.ext import commands
import unicodedata as ud

class Unicode:
    """Encode Unicode characters!"""

    def __init__(self, bot):
        self.bot = bot

    @commands.command()
    async def unicode(self, *, character):
        """Encode a Unicode character."""
        try:
            data = ud.normalize('NFC', character)
        except ValueError:
            data = '<unknown>'
        await self.bot.say(data)

def setup(bot):
    bot.add_cog(Unicode(bot))

Upvotes: 0

Views: 2590

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1123032

If all you need is the Unicode code point, get the ord() value and express that as a hex value:

'U+{:04X}'.format(ord(data[0]))

This will use at least 4 hex digits (uppercased) for a given character, more if the character is outside the basic multilingual plane. I picked the widely accepted U+hhhh format rather than the Python / JSON / Javascript escape sequence format here.

Demo:

>>> data = '⛄'
>>> unicode_codepoint = 'U+{:04X}'.format(ord(data[0]))
>>> print(unicode_codepoint)
U+26C4

You could also encode the data to a JSON string or use the ascii() function to create a string (with quotes) with \u escape sequences:

>>> import json
>>> print(json.dumps(data))
"\u26c4"
>>> print(ascii(data))
'\u26c4'

This has the downside that you now have to remove those quote characters again (use str.strip()).

The difference between the two approaches is that encoding to JSON produces UTF-16 surrogate pairs for characters outside of the BMP, using ascii() you'll get \Uhhhhhhhh Python escape codes:

>>> data = '🖖'
>>> print('U+{:04X}'.format(ord(data[0])))
U+1F596
>>> print(json.dumps(data))
"\ud83d\udd96"
>>> print(ascii(data))
'\U0001f596'

Upvotes: 2

Related Questions