Bob Dylan
Bob Dylan

Reputation: 1833

Get byte offset of error from Python UnicodeDecodeError exception

The problem:

What I tried:

I started by using chardet but the application took a major performance hit because it loaded the entire file into RAM before detecting its encoding. I then thought perhaps I should just read some representative data into chardet's detect method but then realized I'd have no way of not missing any random characters that could cause issues (e.g. the '®' character will cause a problem in a text file that otherwise will decode as UTF-8 just fine). To avoid taking this hit unless I have to, I went this route:

def get_file_handle(self):
    """
    Default encoding is UTF-8. If that fails, try Western European (Windows-1252), else use chardet to detect
    :return: file handle (f)
    """
    try:
        with codecs.open(self.current_file, mode='rb', encoding='utf-8') as f:
            return f
    except UnicodeDecodeError:
        try:
            with codecs.open(self.current_file, mode='rb', encoding='cp1252') as f:
                return f
        except UnicodeDecodeError:
            # read raw data and detect encoding via chardet (last resort)
            raw_data = open(self.current_file, 'r').read()
            result = chardet.detect(raw_data)
            char_enc = result['encoding']
            with codecs.open(self.current_file, mode='rb', encoding=char_enc) as f:
                return f

While this works, in the rare event it reaches the third/innermost exception, it is still reading the entire file into RAM. Simply reading some random representative data may miss the offending character(s) in a text document. Here's what I'd like to do:

I already know how to read the data in chunks (but feel free to add this too), primarily I am interested in how to obtain that byte offset from the traceback.

Upvotes: 1

Views: 396

Answers (1)

Alex Hall
Alex Hall

Reputation: 36033

How about this:

except UnicodeDecodeError as e:
    # read raw data and detect encoding via chardet (last resort)
    with open(self.current_file, 'r') as f:
        f.seek(e.start - 1000)
        raw_data = f.read(2000)
        result = chardet.detect(raw_data)
        ...

Upvotes: 1

Related Questions