bolei
bolei

Reputation: 166

handle non ascii code string in python

It is really confusing to handle non-ascii code char in python. Can any one explain?

I'm trying to read a plain text file and replace all non-alphabetic characters with spaces.

I have a list of characters:

ignorelist = ('!', '-', '_', '(', ')', ',', '.', ':', ';', '"', '\'', '?', '#', '@', '$', '^', '&', '*', '+', '=', '{', '}', '[', ']', '\\', '|', '<', '>', '/', u'—')

for each token i got, i replace any char in that token with space by calling

    for punc in ignorelist:
        token = token.replace(punc, ' ')

notice there's a non ascii code character at the end of ignorelist: u'—'

Everytime when my code encounters that character, it crashes and say:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position

I tried to declare the encoding by adding # -*- coding: utf-8 -*- at the top of the file, but still not working. anyone knows why? Thanks!

Upvotes: 1

Views: 3516

Answers (2)

klobucar
klobucar

Reputation: 6335

Your file input is not utf-8. So when you hit that unicode character your input barfs on the compare because it views your input as ascii.

Try reading the file with this instead.

import codecs
f = codecs.open("test", "r", "utf-8")

Upvotes: 4

lilydjwg
lilydjwg

Reputation: 1713

You are using Python 2.x, and it will try to auto-convert unicodes and plain strs, but it often fails with non-ascii characters.

You shouldn't mix unicodes and strs together. You can either stick to unicodes:

ignorelist = (u'!', u'-', u'_', u'(', u')', u',', u'.', u':', u';', u'"', u'\'', u'?', u'#', u'@', u'$', u'^', u'&', u'*', u'+', u'=', u'{', u'}', u'[', u']', u'\\', u'|', u'<', u'>', u'/', u'—')

if not isinstance(token, unicode):
    token = token.decode('utf-8') # assumes you are using UTF-8
for punc in ignorelist:
    token = token.replace(punc, u' ')

or use only plain strs (note the last one):

ignorelist = ('!', '-', '_', '(', ')', ',', '.', ':', ';', '"', '\'', '?', '#', '@', '$', '^', '&', '*', '+', '=', '{', '}', '[', ']', '\\', '|', '<', '>', '/', u'—'.encode('utf-8'))
# and other parts do not need to change

By manually encoding your u'—' into a str, Python won't need to try that by itself.

I suggest you use unicode all across your program to avoid this kind of errors. But if it'd be too much work, you can use the latter method. However, take care when you call some functions in standard library or third party modules.

# -*- coding: utf-8 -*- only tells Python that your code is written in UTF-8 (or you'll get a SyntaxError).

Upvotes: 2

Related Questions