Reputation: 4995
I'm trying to use string.replace('’','')
to replace the dreaded weird single-quote character: ’ (aka \xe2 aka #8217). But when I run that line of code, I get this error:
SyntaxError: Non-ASCII character '\xe2' in file
EDIT: I get this error when trying to replace characters in a CSV file obtained remotely.
# encoding: utf-8
import urllib2
# read raw CSV data from URL
url = urllib2.urlopen('http://www.aaphoenix.org/meetings/aa_meetings.csv')
raw = url.read()
# replace bad characters
raw = raw.replace('’', "")
print(raw)
Even after the above code is executed, the unwanted character still exists in the print result. I tried the suggestions in the below answers as well. Pretty sure it's an encoding issue, but I just don't know how to fix it, so of course any help is much appreciated.
Upvotes: 10
Views: 20536
Reputation: 5621
I was getting such Non-ASCII character '\xe2'
errors repeatedly with my Python scripts, despite replacing the single-quotes. It turns out the non-ASCII character really was a double en dash (−−). I replaced it with a regular double dash (--) and that fixed it. [Both will look the same on most screens. Depending on your font settings, the problematic one might look a bit longer.]
For anyone encountering the same issue in their Python scripts (in their lines of code, not in data loaded by your script):
Option 1: get rid of the problematic character
Option 2: change the encoding
Declare an encoding at the beginning of the script, as Roberto pointed out:
# encoding: utf-8
Hope this helps someone.
Upvotes: 0
Reputation: 140540
The problem here is with the encoding of the file you downloaded (aa_meetings.csv
). The server doesn't declare an encoding in its HTTP headers, but the only non-ASCII1 octet in the file has the value 0x92. You say that this is supposed to be "the dreaded weird single-quote character", therefore the file's encoding is windows-1252
. But you're trying to search and replace for the UTF-8 encoding of U+2019, i.e. '\xe2\x80\x99'
, which is not what is in the file.
Fixing this is as simple as adding appropriate calls to encode
and decode
:
# encoding: utf-8
import urllib2
# read raw CSV data from URL
url = urllib2.urlopen('http://www.aaphoenix.org/meetings/aa_meetings.csv')
raw = url.read().decode('windows-1252')
# replace bad characters
raw = raw.replace(u'’', u"'")
print(raw.encode("ascii"))
1 by "ASCII" I mean "the character encoding which maps single octets with values 0x00 through 0x7F directly to U+0000 through U+007F, and does not define the meaning of octets with values 0x80 through 0xFF".
Upvotes: 13
Reputation: 177550
This file is encoded in Windows-1252. The apostrophe U+2019
encodes to \x92
in this encoding. The proper thing is to decode the file to Unicode for processing:
data = open('aa_meetings.csv').read()
assert '\x92' in data
chars = data.decode('cp1252')
assert u'\u2019' in chars
fixed = chars.replace(u'\u2019', '')
assert u'\u2019' not in fixed
The problem was you were searching for a UTF-8 encoded U+2019
, i.e. \xe2\x80\x99
, which was not in the file. Converting to Unicode solves this.
Using unicode literals as I have here is an easy way to avoid this mistake. However, you can encode the character directly if you write it as u'’'
:
Python 2.7.1
>>> u'’'
u'\u2019'
>>> '’'
'\xe2\x80\x99'
Upvotes: 3
Reputation: 69031
You can do string.replace('\xe2', "'")
to replace them with the normal single-quote.
Upvotes: 2
Reputation: 33329
You have to declare the encoding of your source file. Put this as one of the first two lines of your code:
# encoding: utf-8
If you are using an encoding other than UTF-8 (for example Latin-1), you have to put that instead.
Upvotes: 8