Get byte offset of error from Python UnicodeDecodeError exception

Question

The problem:

I have large (>2GB) text files containing data of interest
In >90% of cases I can assume this data is formatted as either UTF-8 or Western-European
In some rare cases I have other weird encodings

What I tried:

I started by using chardet but the application took a major performance hit because it loaded the entire file into RAM before detecting its encoding. I then thought perhaps I should just read some representative data into chardet's detect method but then realized I'd have no way of not missing any random characters that could cause issues (e.g. the '®' character will cause a problem in a text file that otherwise will decode as UTF-8 just fine). To avoid taking this hit unless I have to, I went this route:

def get_file_handle(self):
    """
    Default encoding is UTF-8. If that fails, try Western European (Windows-1252), else use chardet to detect
    :return: file handle (f)
    """
    try:
        with codecs.open(self.current_file, mode='rb', encoding='utf-8') as f:
            return f
    except UnicodeDecodeError:
        try:
            with codecs.open(self.current_file, mode='rb', encoding='cp1252') as f:
                return f
        except UnicodeDecodeError:
            # read raw data and detect encoding via chardet (last resort)
            raw_data = open(self.current_file, 'r').read()
            result = chardet.detect(raw_data)
            char_enc = result['encoding']
            with codecs.open(self.current_file, mode='rb', encoding=char_enc) as f:
                return f

While this works, in the rare event it reaches the third/innermost exception, it is still reading the entire file into RAM. Simply reading some random representative data may miss the offending character(s) in a text document. Here's what I'd like to do:

When I get a UnicodeDecodeError, the final line of the traceback is:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 2867043: invalid start byte
I'd like to get the byte offset (0xae) and then grab 1.000 characters before and after from the file to feed to chardet for detection, thus including the offending character plus additional data to base the encoding prediction on.

I already know how to read the data in chunks (but feel free to add this too), primarily I am interested in how to obtain that byte offset from the traceback.

Get byte offset of error from Python UnicodeDecodeError exception

Answers (1)

Related Questions