How to remove all unicode representations in python

Question

I am trying to remove all representations of special characters in my document, for example part of the document says: "world\u2019s", when I split this it gives ['world', '\u2019', 's'] but I need only the word(unicode and 's' removed).
I am already removing all punctuation and this works on the actual punctuation that are shown normally not on these unicode representations. And I have also tried to use regex to match everything that begins with a '\' but that doesn't seem to work either.

Synthaze · Accepted Answer

import re

string = "world\u2019s"

print (re.sub(r"\b([^\s]+)\([^\s]+)\b",r'\1',str(string.encode('ascii', 'backslashreplace'), 'ascii')))

Output:

world

You can apply this to your whole string document, should be working.

import re

string = "world\u2019s h\u2018e"

print (re.sub(r"\b([^\s]+)\([^\s]+)\b",r'\1',str(string.encode('ascii', 'backslashreplace'), 'ascii')))

Output:

world h

How to remove all unicode representations in python

Answers (1)

Related Questions