Reputation: 619
How to replace characters that cannot be decoded using utf8 with whitespace?
# -*- coding: utf-8 -*-
print unicode('\x97', errors='ignore') # print out nothing
print unicode('ABC\x97abc', errors='ignore') # print out ABCabc
How can I print out ABC abc
instead of ABCabc
? Note, \x97
is just an example character. The characters that cannot be decoded are unknown inputs.
errors='ignore'
, it will print out nothing. errors='replace'
, it will replace that character with some special chars. Upvotes: 6
Views: 3960
Reputation: 1863
Take a look at codecs.register_error
. You can use it to register custom error handlers
https://docs.python.org/2/library/codecs.html#codecs.register_error
import codecs
codecs.register_error('replace_with_space', lambda e: (u' ',e.start + 1))
print unicode('ABC\x97abc', encoding='utf-8', errors='replace_with_space')
Upvotes: 9
Reputation: 107337
You can use a try-except
statement to handle the UnicodeDecodeError
:
def my_encoder(my_string):
for i in my_string:
try :
yield unicode(i)
except UnicodeDecodeError:
yield '\t' #or another whietespaces
And then use str.join
method to join your string :
print ''.join(my_encoder(my_string))
Demo :
>>> print ''.join(my_encoder('this is a\x97n exam\x97ple'))
this is a n exam ple
Upvotes: 3