user14500667
user14500667

Reputation:

How to remove all unicode representations in python

I am trying to remove all representations of special characters in my document, for example part of the document says: "world\u2019s", when I split this it gives ['world', '\u2019', 's'] but I need only the word(unicode and 's' removed).
I am already removing all punctuation and this works on the actual punctuation that are shown normally not on these unicode representations. And I have also tried to use regex to match everything that begins with a '\' but that doesn't seem to work either.

Upvotes: 1

Views: 123

Answers (1)

Synthaze
Synthaze

Reputation: 6090

import re

string = "world\u2019s"

print (re.sub(r"\b([^\s]+)\\([^\s]+)\b",r'\1',str(string.encode('ascii', 'backslashreplace'), 'ascii')))

Output:

world

You can apply this to your whole string document, should be working.

import re

string = "world\u2019s h\u2018e"

print (re.sub(r"\b([^\s]+)\\([^\s]+)\b",r'\1',str(string.encode('ascii', 'backslashreplace'), 'ascii')))

Output:

world h

Upvotes: 3

Related Questions