Fregy
Fregy

Reputation: 111

Remove Unicode code (\uxxx) in string Python

I have some Unicode string in a document. All I want is to remove this Unicode code or replace it with some space (" "). Example =""

doc = "Hello my name is Ruth \u2026! I really like swimming and dancing \ud83c"

How do I convert it to the following?

doc = "Hello my name is Ruth! I really like swimming and dancing"

I already tried this: https://stackoverflow.com/a/20078869/5505608, but nothing happens. I'm using Python 3.

Upvotes: 3

Views: 4754

Answers (1)

timgeb
timgeb

Reputation: 78680

You can encode to ASCII and ignore errors (i.e. code points that cannot be converted to an ASCII character).

>>> doc = "Hello my name is Ruth \u2026! I really like swimming and dancing \ud83c"
>>> doc.encode('ascii', errors='ignore')
b'Hello my name is Ruth ! I really like swimming and dancing '

If the trailing whitespace bothers you, strip it off. Depending on your use case, you can decode the result again with ASCII. Chaining everything would look like this:

>>> doc.encode('ascii', errors='ignore').strip().decode('ascii')
'Hello my name is Ruth ! I really like swimming and dancing'

Upvotes: 9

Related Questions