Reputation: 111
I have some Unicode string in a document. All I want is to remove this Unicode code or replace it with some space (" "). Example =""
doc = "Hello my name is Ruth \u2026! I really like swimming and dancing \ud83c"
How do I convert it to the following?
doc = "Hello my name is Ruth! I really like swimming and dancing"
I already tried this: https://stackoverflow.com/a/20078869/5505608, but nothing happens. I'm using Python 3.
Upvotes: 3
Views: 4754
Reputation: 78680
You can encode to ASCII and ignore errors (i.e. code points that cannot be converted to an ASCII character).
>>> doc = "Hello my name is Ruth \u2026! I really like swimming and dancing \ud83c"
>>> doc.encode('ascii', errors='ignore')
b'Hello my name is Ruth ! I really like swimming and dancing '
If the trailing whitespace bothers you, strip
it off. Depending on your use case, you can decode the result again with ASCII. Chaining everything would look like this:
>>> doc.encode('ascii', errors='ignore').strip().decode('ascii')
'Hello my name is Ruth ! I really like swimming and dancing'
Upvotes: 9