pazzdzioh
pazzdzioh

Reputation: 71

UnicodeDecodeError parsing file with for loop python3

I got UnicodeDecodeError when I loop line in file.

with open(somefile,'r') as f:
    for line in f:
        #do something

This happend when I use python 3.4. In general I have some files which contain some no UTF-8 chars. I want to parse file line by line and find line where problem apper and got exact index in line where such non utf-8 appeard. I have ready code for it but it works uner python 2.7.9 but under python 3.4 I got UnicodeDecodeError when for loop is executed. Any ideas???

Upvotes: 2

Views: 1367

Answers (1)

Robᵩ
Robᵩ

Reputation: 168626

You need to open the file in binary mode and decode the lines one at a time. Try this:

with open('badutf.txt', 'rb') as f:
    for i, line in enumerate(f,1):
        try:
            line.decode('utf-8')
        except UnicodeDecodeError as e:
            print ('Line: {}, Offset: {}, {}'.format(i, e.start, e.reason))

Here is the result I get in Python3:

Line: 16, Offset: 6, invalid start byte

Sure enough, line 16, position 6 is the bad byte.

Upvotes: 2

Related Questions