Reputation: 309
In the below code,
text = "\u54c8\u54c8\u54c8\u54c8"
Is there a way to convert the unicode code above to keeping the value only, and remove "\u" from it.
So "\u54c8"
becomes "54c8"
instead.
In javascript I can do text.charCodeAt(n).toString(16)
, but I can't figure out the equivalent solution in python.
I tried to use regex to match it,
pattern = re.compile('[\u0000-\uFFFF]')
matches = pattern.finditer(text)
for match in matches:
print(match)
But all it did was printing out the character that the unicode value represent.
Upvotes: 0
Views: 445
Reputation: 189
You can do that like this: You can ignore non-ASCII chars and encode to ASCII, or you can encode to UTF-8
text = "\u54c8\u54c8\u54c8\u54c8"
utf8string = text.encode("utf-8")
asciistring1 = text.encode("ascii", 'ignore')
asciistring2 = text.encode("ascii", 'replace')
You can refer to https://www.oreilly.com/library/view/python-cookbook/0596001673/ch03s18.html
Upvotes: 0
Reputation: 168834
You can use a regular list comprehension to map over the 4 characters in text
, and use ord
to get the ord
inal (integer) of the codepoint, then hex()
to convert it to hexadecimal. The [2:]
slice is required to get rid of the 0x
Python would otherwise add.
>>> text = "\u54c8\u54c8\u54c8\u54c8"
>>> text
'哈哈哈哈'
>>> [hex(ord(c))[2:] for c in text]
['54c8', '54c8', '54c8', '54c8']
>>>
You can then use e.g. "".join()
if you need a single string.
(Another way to write the comprehension would be to use an f-string and the x
hex format:
>>> [f'{ord(c):x}' for c in text]
['54c8', '54c8', '54c8', '54c8']
)
If you actually have a string \u54c8\u54c8\u54c8\u54c8
, i.e. "backslash, u, five, four, c, eight" repeated 4 times, you'll need to first decode the backslash escape sequences to get the 4-codepoint string:
>>> text = r"\u54c8\u54c8\u54c8\u54c8"
>>> codecs.decode(text, "unicode_escape")
'哈哈哈哈'
Upvotes: 1