Reputation: 6696
In a text file I'm processing, I have characters like ����. Not sure what they are.
I'm wondering how to remove/convert these characters.
I have tried to convert it into ascii by using .encode(‘ascii’,'ignore’). python told me char is not whithin 0,128
I have also tried unicodedata, unicodedata.normalize('NFKD', text).encode('ascii','ignore'), with the same error
Anyone help?
Thanks!
Upvotes: 4
Views: 10560
Reputation: 375854
You can always take a Unicode string an use the code you showed:
my_ascii = my_uni_string.encode('ascii', 'ignore')
If that gave you an error, then you didn't really have a Unicode string to begin with. If that is true, then you have a byte string instead. You'll need to know what encoding it's using, and you can turn it into a Unicode string with:
my_uni_string = my_byte_string.decode('utf8')
(assuming your encoding is UTF-8).
This split between byte string and Unicode string can be confusing. My presentation, Pragmatic Unicode, or, How Do I Stop The Pain can help you to keep it all straight.
Upvotes: 8
Reputation: 50205
It's not perfect (especially for shorter strings) but the chardet library would be of use here:
http://pypi.python.org/pypi/chardet
To have chardet figure out the encoding and then encode as unicode you would do:
import chardet
encoding = chardet.detect(some_string)['encoding']
unicode_string = unicode(some_string, encoding)
Of course, you won't be able to encode them as ascii if they're out of the ascii range.
Upvotes: 1