alexs973
alexs973

Reputation: 183

Get rid of invalid unicode character in string variable

I've put in a python3 requests get command (not sure if that's good phrasing), converted it to json, and have parsed it to receive the name:

'Harrison Elementary School \U0001f3eb'

I looked it up and the unicode character stands for a school, Unicode School Character. But when I print it, I get:

return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f3eb' in position 27: character maps to <undefined>

I really don't care about having that unicode character. It's not important for my purposes.

How can I strip that unicode character and any other invalid characters from this or any string I come across?

Upvotes: 0

Views: 3378

Answers (2)

Mark Ransom
Mark Ransom

Reputation: 308432

First, you have to determine why the characters are invalid. It appears that the error message was generated when you tried to print the string, meaning that the Unicode character could not be encoded using the default output encoding. For print this should be sys.stdout.encoding.

You can encode the string yourself and ignore characters that are invalid, but that leaves you with a byte string. It is necessary to decode those bytes back into a Unicode string.

def sanitize(s, encoding, errors='ignore'):
    return s.encode(encoding, errors=errors).decode(encoding)

>>> import sys
>>> print(sanitize('Harrison Elementary School \U0001f3eb', sys.stdout.encoding))
Harrison Elementary School 

Upvotes: 1

FredrikHedman
FredrikHedman

Reputation: 1253

The character is not really invalid, just undefined, so when you are encoding you can often tell the encoder how to handle errors:

import codecs 

school_name = "Harrison Elementary School \U0001f3eb"
encoded_name = codecs.charmap_encode(school_name, 'ignore')
print(encoded_name) 

With result (b'Harrison Elementary School ', 28)

Upvotes: 2

Related Questions