UnicodeDecodeError: 'utf8' in Python 2.7

Question

I have a large file that has many lines, most of the lines are utf8, but looks like a few of lines are not utf8. When I try to read lines with a code like this:

 in_file = codecs.open(source, "r", "utf-8")
     for line in in_file:
         SOME OPERATIONS

I get the following error:

    for line in in_file:
  File "C:\Python27\lib\codecs.py", line 681, in next
    return self.reader.next()
  File "C:\Python27\lib\codecs.py", line 612, in next
    line = self.readline()
  File "C:\Python27\lib\codecs.py", line 527, in readline
    data = self.read(readsize, firstline=True)
  File "C:\Python27\lib\codecs.py", line 474, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd8 in position 0: invalid continuation byte

What I would like to do is that for lines that are not utf8 do nothing without breaking the code, and then go to next line in the file and do my operations. How can I do it with try and except?

Ulrich Eckhardt · Accepted Answer

Open the file without any codec. Then, read the file line-by-line and try to decode each line from UTF-8. If that raises an exception, skip the line.

A completely different approach would be to tell the codec to replace or ignore faulty characters. This doesn't skip the lines but you don't seem to care too much about the contained data anyway, so it might be an alternative.

UnicodeDecodeError: 'utf8' in Python 2.7

Answers (1)

Related Questions

UnicodeDecodeError: &#39;utf8&#39; in Python 2.7

Answers (1)

Related Questions

UnicodeDecodeError: 'utf8' in Python 2.7