Reputation: 3
I tried methods like encode using "unicode-display", raw string, But it didn't work. I am writing a module for a chat bot in python which involves getting a character from the user and showing it in the "\uXXXX" format instead of it turning into the respective character. Here's my code:
import discord
from discord.ext import commands
import unicodedata as ud
class Unicode:
"""Encode Unicode characters!"""
def __init__(self, bot):
self.bot = bot
@commands.command()
async def unicode(self, *, character):
"""Encode a Unicode character."""
try:
data = ud.normalize('NFC', character)
except ValueError:
data = '<unknown>'
await self.bot.say(data)
def setup(bot):
bot.add_cog(Unicode(bot))
Upvotes: 0
Views: 2590
Reputation: 1123032
If all you need is the Unicode code point, get the ord()
value and express that as a hex value:
'U+{:04X}'.format(ord(data[0]))
This will use at least 4 hex digits (uppercased) for a given character, more if the character is outside the basic multilingual plane. I picked the widely accepted U+hhhh
format rather than the Python / JSON / Javascript escape sequence format here.
Demo:
>>> data = '⛄'
>>> unicode_codepoint = 'U+{:04X}'.format(ord(data[0]))
>>> print(unicode_codepoint)
U+26C4
You could also encode the data to a JSON string or use the ascii()
function to create a string (with quotes) with \u
escape sequences:
>>> import json
>>> print(json.dumps(data))
"\u26c4"
>>> print(ascii(data))
'\u26c4'
This has the downside that you now have to remove those quote characters again (use str.strip()
).
The difference between the two approaches is that encoding to JSON produces UTF-16 surrogate pairs for characters outside of the BMP, using ascii()
you'll get \Uhhhhhhhh
Python escape codes:
>>> data = '🖖'
>>> print('U+{:04X}'.format(ord(data[0])))
U+1F596
>>> print(json.dumps(data))
"\ud83d\udd96"
>>> print(ascii(data))
'\U0001f596'
Upvotes: 2