Gady
Gady

Reputation: 4995

Replacing a weird single-quote (’) with blank string in Python

I'm trying to use string.replace('’','') to replace the dreaded weird single-quote character: ’ (aka \xe2 aka #8217). But when I run that line of code, I get this error:

SyntaxError: Non-ASCII character '\xe2' in file

EDIT: I get this error when trying to replace characters in a CSV file obtained remotely.

# encoding: utf-8

import urllib2

# read raw CSV data from URL
url = urllib2.urlopen('http://www.aaphoenix.org/meetings/aa_meetings.csv')
raw = url.read()

# replace bad characters
raw = raw.replace('’', "")

print(raw)

Even after the above code is executed, the unwanted character still exists in the print result. I tried the suggestions in the below answers as well. Pretty sure it's an encoding issue, but I just don't know how to fix it, so of course any help is much appreciated.

Upvotes: 10

Views: 20536

Answers (5)

Fabien Snauwaert
Fabien Snauwaert

Reputation: 5621

I was getting such Non-ASCII character '\xe2' errors repeatedly with my Python scripts, despite replacing the single-quotes. It turns out the non-ASCII character really was a double en dash (−−). I replaced it with a regular double dash (--) and that fixed it. [Both will look the same on most screens. Depending on your font settings, the problematic one might look a bit longer.]

For anyone encountering the same issue in their Python scripts (in their lines of code, not in data loaded by your script):

Option 1: get rid of the problematic character

  • Re-type the line by hand. (To make sure you did not copy-paste the problematic character by mistake.)
  • Note that commenting the line out will not work.
  • Check whether the problematic character really is the one you think.

Option 2: change the encoding

Declare an encoding at the beginning of the script, as Roberto pointed out:

# encoding: utf-8

Hope this helps someone.

Upvotes: 0

zwol
zwol

Reputation: 140540

The problem here is with the encoding of the file you downloaded (aa_meetings.csv). The server doesn't declare an encoding in its HTTP headers, but the only non-ASCII1 octet in the file has the value 0x92. You say that this is supposed to be "the dreaded weird single-quote character", therefore the file's encoding is windows-1252. But you're trying to search and replace for the UTF-8 encoding of U+2019, i.e. '\xe2\x80\x99', which is not what is in the file.

Fixing this is as simple as adding appropriate calls to encode and decode:

# encoding: utf-8
import urllib2

# read raw CSV data from URL
url = urllib2.urlopen('http://www.aaphoenix.org/meetings/aa_meetings.csv')
raw = url.read().decode('windows-1252')

# replace bad characters
raw = raw.replace(u'’', u"'")

print(raw.encode("ascii"))

1 by "ASCII" I mean "the character encoding which maps single octets with values 0x00 through 0x7F directly to U+0000 through U+007F, and does not define the meaning of octets with values 0x80 through 0xFF".

Upvotes: 13

Josh Lee
Josh Lee

Reputation: 177550

This file is encoded in Windows-1252. The apostrophe U+2019 encodes to \x92 in this encoding. The proper thing is to decode the file to Unicode for processing:

data = open('aa_meetings.csv').read()
assert '\x92' in data
chars = data.decode('cp1252')
assert u'\u2019' in chars
fixed = chars.replace(u'\u2019', '')
assert u'\u2019' not in fixed

The problem was you were searching for a UTF-8 encoded U+2019, i.e. \xe2\x80\x99, which was not in the file. Converting to Unicode solves this.

Using unicode literals as I have here is an easy way to avoid this mistake. However, you can encode the character directly if you write it as u'’':

Python 2.7.1
>>> u'’'
u'\u2019'
>>> '’'
'\xe2\x80\x99'

Upvotes: 3

Ethan Furman
Ethan Furman

Reputation: 69031

You can do string.replace('\xe2', "'") to replace them with the normal single-quote.

Upvotes: 2

Roberto Bonvallet
Roberto Bonvallet

Reputation: 33329

You have to declare the encoding of your source file. Put this as one of the first two lines of your code:

# encoding: utf-8

If you are using an encoding other than UTF-8 (for example Latin-1), you have to put that instead.

Upvotes: 8

Related Questions