Matt R
Matt R

Reputation: 10533

UnicodeDecodeError when reading dictionary words file with simple Python script

First time doing Python in a while, and I'm having trouble doing a simple scan of a file when I run the following script with Python 3.0.1,

with open("/usr/share/dict/words", 'r') as f:
   for line in f:
       pass

I get this exception:

Traceback (most recent call last):
  File "/home/matt/install/test.py", line 2, in <module>
    for line in f:
  File "/home/matt/install/root/lib/python3.0/io.py", line 1744, in __next__
    line = self.readline()
  File "/home/matt/install/root/lib/python3.0/io.py", line 1817, in readline
    while self._read_chunk():
  File "/home/matt/install/root/lib/python3.0/io.py", line 1565, in _read_chunk
    self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
  File "/home/matt/install/root/lib/python3.0/io.py", line 1299, in decode
    output = self.decoder.decode(input, final=final)
  File "/home/matt/install/root/lib/python3.0/codecs.py", line 300, in decode
   (result, consumed) = self._buffer_decode(data, self.errors, final)
 UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1689-1692: invalid data

The line in the file it blows up on is "Argentinian", which doesn't seem to be unusual in any way.

Update: I added,

encoding="iso-8559-1"

to the open() call, and it fixed the problem.

Upvotes: 0

Views: 2028

Answers (2)

Matthew Flaschen
Matthew Flaschen

Reputation: 284927

Can you check to make sure it is valid UTF-8? A way to do that is given at this SO question:

iconv -f UTF-8 /usr/share/dict/words -o /dev/null

There are other ways to do the same thing.

Upvotes: 1

John Machin
John Machin

Reputation: 83002

How have you determined from "position 1689-1692" what line in the file it has blown up on? Those numbers would be offsets in the chunk that it's trying to decode. You would have had to determine what chunk it was -- how?

Try this at the interactive prompt:

buf = open('the_file', 'rb').read()
len(buf)
ubuf = buf.decode('utf8')
# splat ... but it will give you the byte offset into the file
buf[offset-50:60] # should show you where/what the problem is
# By the way, from the error message, looks like a bad
# FOUR-byte UTF-8 character ... interesting

Upvotes: 1

Related Questions