Aditya Verma
Aditya Verma

Reputation: 15

Replacing Unicode Characters with actual symbols

string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President-Elect"

I want to get rid of the <U + 2019> and replace it with '. Is there a way to do this in python?

Edit : I also have instances of <U + 2014>, <U + 201C> etc. Looking for something which can replace all of this with appropriate values

Upvotes: 0

Views: 1423

Answers (5)

Mark Tolonen
Mark Tolonen

Reputation: 178115

Replace them all at once with re.sub:

import re

string = "testing<U+2019> <U+2014> <U+201C>testing<U+1F603>"

result = re.sub(r'<U\+([0-9a-fA-F]{4,6})>', lambda x: chr(int(x.group(1),16)), string)
print(result)

Output:

testing’ — “testing😃

The regular expression matches <U+hhhh> where hhhh can be 4-6 hexadecimal characters. Note that Unicode defines code points from U+0000 to U+10FFFF so this accounts for that. The lambda replacement function converts the string hhhh to an integer using base 16 and then converts that number to a Unicode character.

Upvotes: 2

JosefZ
JosefZ

Reputation: 30238

Here's my solution for all code points denoted as U+0000 through U+10FFFF ("U+" followed by the code point value in hexadecimal, which is prepended with leading zeros to a minimum of four digits):

import re
def UniToChar(unicode_notation):
    return chr(int(re.findall(r'<U\+([a-hA-H0-9]{4,5})>',unicode_notation)[0],16))

xx= '''
At Donald<U+2019>s <U+2016>Elect<U+2016> in <U+2017>2019<U+2017>
<U+00C0> la Donald<U+2019>s friend <U+1F986>. <U+1F929><U+1F92A><U+1F601>
'''
for x in xx.split('\n'):
    abc =  re.findall(r'<U\+[a-hA-H0-9]{4,5}>',x)
    if len(abc) > 0:
        for uniid in set(abc): x=x.replace(uniid, UniToChar(uniid))
    
    print(repr(x).strip("'"))

Output: 71307293.py

At Donald’s ‖Elect‖ in ‗2019‗
À la Donald’s friend 🦆. 🤩🤪😁

In fact, private range from U+100000 to U+10FFFD (Plane 16) isn't detected using above simplified regex… Improved code follows:

import re
def UniToChar(unicode_notation):
    aux = int(re.findall(r'<U\+([a-hA-H0-9]{4,6})>',unicode_notation)[0],16)
    # circumvent the "ValueError: chr() arg not in range(0x110000)"
    if aux <= 0x10FFFD:
        return chr(aux)
    else:
        return chr(0xFFFD) # Replacement Character

xx= '''
At Donald<U+2019>s <U+2016>Elect<U+2016> in <U+2017>2019<U+2017>
<U+00C0> la Donald<U+2019>s friend <U+1F986>. <U+1F929><U+1F92A><U+1F601>
Unassigned: <U+05ff>; out of Unicode range: <U+110000>.
'''
for x in xx.split('\n'):
    abc =  re.findall(r'<U\+[a-hA-H0-9]{4,6}>',x)
    if len(abc) > 0:
        for uniid in set(abc): x=x.replace(uniid, UniToChar(uniid))
    
    print(repr(x).strip("'"))

Output: 71307293.py

At Donald’s ‖Elect‖ in ‗2019‗
À la Donald’s friend 🦆. 🤩🤪😁
Unassigned: \u05ff; out of Unicode range: �.

Upvotes: 1

yassine ben
yassine ben

Reputation: 32

what version of python are u using?

I edited my answer so it can bee used with multiple code point in the same string

well u need to convert the unicode's code point that is between < >, to unicode char

I used regex to get the unicode's code point and then convert it to the corresponding uniode char

import re

string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President<U+2014>Elect"

repbool = re.search('[<][U][+]\d{4}[>]', string)

while repbool:
  rep = re.search('[<][U][+]\d{4}[>]', string).group()
  
  string=string.replace(rep, chr(int(rep[1:-1][2:], 16)))
 
  repbool = re.search('[<][U][+]\d{4}[>]', string)
  

print(string)

Upvotes: -1

PW1990
PW1990

Reputation: 479

You can replace using .replace()

print(string.replace('<U+2019>', "'"))

Or if your year changes, you can use re. But make it more attractive than mine.

import re

string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President-Elect"

rep = re.search('[<][U][+]\d{4}[>]', string).group()

print(string.replace(rep, "'"))

Upvotes: 0

CrYbAbY
CrYbAbY

Reputation: 112

I guess this solves the problem if its just one or two of these characters.

>>> string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President-Elect"
>>> string.replace("<U+2019>","'")
"At Donald Trump's Properties, a Showcase for a Brand and a President-Elect"

If there are many if these substitutions to be done, consider using 'map()' method.
Source: Removing \u2018 and \u2019 character

Upvotes: 0

Related Questions