Reputation: 8520
I have a large file that has many lines, most of the lines are utf8, but looks like a few of lines are not utf8. When I try to read lines with a code like this:
in_file = codecs.open(source, "r", "utf-8")
for line in in_file:
SOME OPERATIONS
I get the following error:
for line in in_file:
File "C:\Python27\lib\codecs.py", line 681, in next
return self.reader.next()
File "C:\Python27\lib\codecs.py", line 612, in next
line = self.readline()
File "C:\Python27\lib\codecs.py", line 527, in readline
data = self.read(readsize, firstline=True)
File "C:\Python27\lib\codecs.py", line 474, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd8 in position 0: invalid continuation byte
What I would like to do is that for lines that are not utf8 do nothing without breaking the code, and then go to next line in the file and do my operations. How can I do it with try
and except
?
Upvotes: 0
Views: 542
Reputation: 17444
Open the file without any codec. Then, read the file line-by-line and try to decode each line from UTF-8. If that raises an exception, skip the line.
A completely different approach would be to tell the codec to replace or ignore faulty characters. This doesn't skip the lines but you don't seem to care too much about the contained data anyway, so it might be an alternative.
Upvotes: 1