marlon
marlon

Reputation: 7633

Why doesn't it remove this unicode character?

JUNK_PATTERN = re.compile(r"[●~〜╮╯▽╰╭★→…&*^❤~\u200b]")
text = 'test <200b><200b>'
print(len(text), text)
text = remove_junk(text)
print(len(text), text)

def remove_junk(text):
    return re.sub(JUNK_PATTERN, "", text).strip()

The output is:

17 test <200b><200b>
17 test <200b><200b>

Why isn't the <200b> not removed by the re?

Upvotes: 0

Views: 300

Answers (1)

Arty
Arty

Reputation: 16737

You should un-escape unicode encoded chars like <200b>, converting them to real 1-char unicode sequence. Full corrected code down below:

import re

JUNK_PATTERN = re.compile(r"[●~〜╮╯▽╰╭★→…&*^❤~\u200b]")

def remove_junk(text):
    return re.sub(JUNK_PATTERN, "", text).strip()
    
def unescape_uni_codes(text):
    for m in reversed(list(re.finditer(r'<[a-fA-F\d]{4}>', text))):
        s = m.span()
        text = text[:s[0]] + bytes().fromhex(m.group(0)[1:-1]).decode('utf-16-be') + text[s[1]:]
    return text

text = 'test <200b> <200c>'

print(len(text), text)
text = unescape_uni_codes(text)
print(len(text), text)
text = remove_junk(text)
print(len(text), text)

Upvotes: 1

Related Questions