Reputation: 10533
First time doing Python in a while, and I'm having trouble doing a simple scan of a file when I run the following script with Python 3.0.1,
with open("/usr/share/dict/words", 'r') as f:
for line in f:
pass
I get this exception:
Traceback (most recent call last):
File "/home/matt/install/test.py", line 2, in <module>
for line in f:
File "/home/matt/install/root/lib/python3.0/io.py", line 1744, in __next__
line = self.readline()
File "/home/matt/install/root/lib/python3.0/io.py", line 1817, in readline
while self._read_chunk():
File "/home/matt/install/root/lib/python3.0/io.py", line 1565, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
File "/home/matt/install/root/lib/python3.0/io.py", line 1299, in decode
output = self.decoder.decode(input, final=final)
File "/home/matt/install/root/lib/python3.0/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1689-1692: invalid data
The line in the file it blows up on is "Argentinian", which doesn't seem to be unusual in any way.
Update: I added,
encoding="iso-8559-1"
to the open() call, and it fixed the problem.
Upvotes: 0
Views: 2028
Reputation: 284927
Can you check to make sure it is valid UTF-8? A way to do that is given at this SO question:
iconv -f UTF-8 /usr/share/dict/words -o /dev/null
There are other ways to do the same thing.
Upvotes: 1
Reputation: 83002
How have you determined from "position 1689-1692" what line in the file it has blown up on? Those numbers would be offsets in the chunk that it's trying to decode. You would have had to determine what chunk it was -- how?
Try this at the interactive prompt:
buf = open('the_file', 'rb').read()
len(buf)
ubuf = buf.decode('utf8')
# splat ... but it will give you the byte offset into the file
buf[offset-50:60] # should show you where/what the problem is
# By the way, from the error message, looks like a bad
# FOUR-byte UTF-8 character ... interesting
Upvotes: 1