Reputation: 93803
I'm reading in a file with Python's csv
module, and have Yet Another Encoding Question (sorry, there are so many on here).
In the CSV file, there are £ signs. After reading the row in and printing it, they have become \xa3.
Trying to encode them as Unicode produces a UnicodeDecodeError
:
row = [unicode(x.strip()) for x in row]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 0: ordinal not in range(128)
I have been reading the csv documentation and the numerous other questions about this on StackOverflow. I think that £ becoming \xa3 in ASCII means that the original CSV file is in UTF-8.
(Incidentally, is there a quick way to check the encoding of a CSV file?)
If it's in UTF-8, then shouldn't the csv module be able to cope with it? It seems to be transforming all the symbols into ASCII, even though the documentation claims it accepts UTF-8.
I've tried adding a unicode_csv_reader
function as described in the csv examples, but it doesn't help.
---- EDIT -----
I should clarify one thing. I have seen this question, which looks very similar. But adding the unicode_csv_reader
function defined there produces a different error instead:
yield [unicode(cell, 'utf-8') for cell in row]
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3 in position 8: unexpected code byte
So maybe my file isn't UTF8 after all? How can I tell?
Upvotes: 10
Views: 7351
Reputation: 82934
If you are on Windows, it is highly likely that the encoding that you should use is one of the cp125X family ... e.g. if you are in Western Europe or the Americas, it will be cp1252
. Windows software often uses bytes in the range \x80
to \x9F
inclusive to encode fancy punctuation characters whereas that range is reserved in ISO-8859-X for the rarely used "C1 Control Characters".
You can find out the usual encoding in your locale by running this at the command line:
python -c "import locale; print locale.getpreferredencoding()"
Upvotes: 0
Reputation: 14223
Try using the "ISO-8859-1" for your encoding. It seems like you are dealing with extended ASCII, not Unicode.
Edit:
Here's some simple code that deals with extended ASCII:
>>> s = "La Pe\xf1a"
>>> print s
La Pe±a
>>> print s.decode("latin-1")
La Peña
>>>
Even better, dealing with the exact character that is giving you problems:
>>> s = "12\xa3"
>>> print s.decode("latin-1")
12£
>>>
Upvotes: 7