geotheory
geotheory

Reputation: 23690

Removing xml unicode characters from strings

I'm struggling to remove xml unicode characters from strings. Adapting this solution for Python 3 fails:

s = 'fooСъбbar'
s.encode('ascii', errors='ignore')
# b'fooСъбbar'

I've also tried unescaping with xml.sax.saxutils but with no luck:

unescape(s).encode('ascii', errors='ignore')
# b'fooСъbar'

Any suggestions appreciated.

Upvotes: 0

Views: 274

Answers (1)

Daweo
Daweo

Reputation: 36838

You might harness html.unescape for this task

import html
s = 'fooСъбbar'
s2 = html.unescape(s).encode('ascii', errors='ignore')
print(s2)

output:

b'foobar'

Upvotes: 1

Related Questions