How to open mixed encoded unicode files in python 3.6?

Question

I receive files from an undocumented resource that can contain data that looks like:

16058637149881541301278JA1コノマンガガスゴイヘンシュウブ4
#recordsWritten:1293462

The above is just an example, the files I'm working with contain all kinds of different languages (and thus encodings). I'm then opening my file with Python 3.6 (an inherited code base that I've upped from Python 2 to Python 3) using the following code:

import os

f = open(file_path, "r")

f.seek(0, os.SEEK_END)
f.seek(f.tell() -40, os.SEEK_SET)
records_str = f.read()
print(records_str)

Using this code, I receive a: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 0: invalid start byte

if I change it to include an encoding:

f = open(file_path, "r", encoding='utf-8'), I receive the same error.

Changing the encoding to utf-16 results in it printing:

랂菣Ꚃ菣Ɩȴ⌊敲潣摲坳楲瑴湥ㄺ㤲㐳㈶ਂ

Which appears to be wrong.

Switching it to open the file in binary mode: f = open(file_path, "rb") results in it outputting:

b'\x82\xb7\xe3\x83\xa5\xe3\x82\xa6\xe3\x83\x96\x014\x02 #recordsWritten:1293462\x02 '

Now this is slightly better, however, when I eventually come to processing the file, I don't want to be adding \x82\xb7\xe3\x83\xa5\ to my database, I'd rather add the ガガスゴイヘンシ. So, is there a way to handle Unicode encoded files? I've also looked at the Mozilla chardet project to try and determine encoding, but following code examples, it thinks the file is utf-8 encoded.

tripleee · Accepted Answer

If you seek into the middle of a UTF-8 sequence, the error message doesn't necessarily mean the data isn't actually UTF-8, just that you can't seek to that exact position and get a useful decoding. "Invalid start byte" means this cannot be the beginning of a valid UTF-8 string.

If you only need to retrieve the last line of the file, maybe just read the entire file and pluck off the last line, or use try/ except until you find a position you can safely seek to. Or simply read part or all of the file as bytes and then decode only the last line.

import os

with open(file_path, "rb") as f:  # notice "b" in "rb"
    f.seek(0, os.SEEK_END)
    f.seek(f.tell() -40, os.SEEK_SET)
    records_bytes = f.read()
records_str = records_bytes.split(b'
')[-2].decode('ascii')
print(records_str)

We use[-2] on the assumption that the file contains a final newline at the end (i.e. it is a well-formed text file) and so [-1] is simply an empty string, and this retrieves the last actual line.

(Posting this as a separate answer so as not to pollute my other answer, which I hope might also be more useful to future visitors.)

How to open mixed encoded unicode files in python 3.6?

Answers (2)

Related Questions