Reputation: 371
I want to read a csv file and process some columns but I keep getting issues. Stuck with the following error:
Traceback (most recent call last):
File "C:\Users\Sven\Desktop\Python\read csv.py", line 5, in <module>
for row in reader:
File "C:\Python34\lib\codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 446: invalid start byte
>>>
My Code
import csv
with open("c:\\Users\\Sven\\Desktop\\relaties 24112014.csv",newline='', encoding="utf8") as f:
reader = csv.reader(f,delimiter=';',quotechar='|')
#print(sum(1 for row in reader))
for row in reader:
print(row)
if row:
value = row[6]
value = value.replace('(', '')
value = value.replace(')', '')
value = value.replace(' ', '')
value = value.replace('.', '')
value = value.replace('0032', '0')
if len(value) > 0:
print(value + ' Length: ' + str(len(value)))
I'm a beginner with Python, tried googling, but hard to find the right solution.
Can anyone help me out?
Upvotes: 14
Views: 34832
Reputation: 371
I was also getting the similar error when trying to read or upload the following kinds of files:
The best way to avoid error like:
is to read these files as Bytes. When you treat them as byte then you need not provide any encoding value here. So when you open them you should specify:
with open(file_path, 'rb') as file:
Or in your case, the code should be something like:
import csv
with open("c:\\Users\\Sven\\Desktop\\relaties 24112014.csv", newline='', 'rb') as f:
reader = csv.reader(f,delimiter=';',quotechar='|')
Upvotes: 4
Reputation: 3885
The first byte of a .PNG file is 0x89. Not saying that is your problem, but the .PNG header is specifically designed so that it is NOT accidentally interpreted as text.
Why you would have a .csv file that is actually a .png I don't know. But it definitely could happen if someone accidentally renamed the file. On windows 10 every once and a while I accidentally mass-rename files by accident because of their stupid checkbox feature. Why Microsoft decided desktop machines having identical UI controls to tablets was I good idea... I don't know.
Upvotes: 8
Reputation: 391
This is the most important clue:
invalid start byte
\x89
is not, as suggested in the comments, an invalid UTF-8 byte. It is a completely valid continuation byte. Meaning if it follows the correct byte value, it codes UTF-8 correctly:
http://hexutf8.com/?q=0xc90x89
So either you (1) do not have UTF-8 data as you expect, or (2) you have some malformed UTF-8 data. The Python codec is simply letting you know that it encountered \x89
in the wrong order in the sequence.
(More on continuation bytes here: http://en.wikipedia.org/wiki/UTF-8#Codepage_layout)
Upvotes: 7