Reputation: 183
I've put in a python3 requests get command (not sure if that's good phrasing), converted it to json, and have parsed it to receive the name:
'Harrison Elementary School \U0001f3eb'
I looked it up and the unicode character stands for a school, Unicode School Character. But when I print it, I get:
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f3eb' in position 27: character maps to <undefined>
I really don't care about having that unicode character. It's not important for my purposes.
How can I strip that unicode character and any other invalid characters from this or any string I come across?
Upvotes: 0
Views: 3378
Reputation: 308432
First, you have to determine why the characters are invalid. It appears that the error message was generated when you tried to print the string, meaning that the Unicode character could not be encoded using the default output encoding. For print
this should be sys.stdout.encoding
.
You can encode the string yourself and ignore characters that are invalid, but that leaves you with a byte string. It is necessary to decode
those bytes back into a Unicode string.
def sanitize(s, encoding, errors='ignore'):
return s.encode(encoding, errors=errors).decode(encoding)
>>> import sys
>>> print(sanitize('Harrison Elementary School \U0001f3eb', sys.stdout.encoding))
Harrison Elementary School
Upvotes: 1
Reputation: 1253
The character is not really invalid, just undefined
, so when you are encoding you can often tell the encoder how to handle errors:
import codecs
school_name = "Harrison Elementary School \U0001f3eb"
encoded_name = codecs.charmap_encode(school_name, 'ignore')
print(encoded_name)
With result (b'Harrison Elementary School ', 28)
Upvotes: 2