Reputation: 7633
JUNK_PATTERN = re.compile(r"[●~〜╮╯▽╰╭★→…&*^❤~\u200b]")
text = 'test <200b><200b>'
print(len(text), text)
text = remove_junk(text)
print(len(text), text)
def remove_junk(text):
return re.sub(JUNK_PATTERN, "", text).strip()
The output is:
17 test <200b><200b>
17 test <200b><200b>
Why isn't the <200b> not removed by the re?
Upvotes: 0
Views: 300
Reputation: 16737
You should un-escape unicode encoded chars like <200b>
, converting them to real 1-char unicode sequence. Full corrected code down below:
import re
JUNK_PATTERN = re.compile(r"[●~〜╮╯▽╰╭★→…&*^❤~\u200b]")
def remove_junk(text):
return re.sub(JUNK_PATTERN, "", text).strip()
def unescape_uni_codes(text):
for m in reversed(list(re.finditer(r'<[a-fA-F\d]{4}>', text))):
s = m.span()
text = text[:s[0]] + bytes().fromhex(m.group(0)[1:-1]).decode('utf-16-be') + text[s[1]:]
return text
text = 'test <200b> <200c>'
print(len(text), text)
text = unescape_uni_codes(text)
print(len(text), text)
text = remove_junk(text)
print(len(text), text)
Upvotes: 1