Reputation:
I am trying to remove all representations of special characters in my document, for example part of the document says: "world\u2019s", when I split this it gives ['world', '\u2019', 's']
but I need only the word(unicode and 's' removed).
I am already removing all punctuation and this works on the actual punctuation that are shown normally not on these unicode representations.
And I have also tried to use regex to match everything that begins with a '\' but that doesn't seem to work either.
Upvotes: 1
Views: 123
Reputation: 6090
import re
string = "world\u2019s"
print (re.sub(r"\b([^\s]+)\\([^\s]+)\b",r'\1',str(string.encode('ascii', 'backslashreplace'), 'ascii')))
Output:
world
You can apply this to your whole string document, should be working.
import re
string = "world\u2019s h\u2018e"
print (re.sub(r"\b([^\s]+)\\([^\s]+)\b",r'\1',str(string.encode('ascii', 'backslashreplace'), 'ascii')))
Output:
world h
Upvotes: 3