Jarede
Jarede

Reputation: 3488

How to open mixed encoded unicode files in python 3.6?

I receive files from an undocumented resource that can contain data that looks like:

16058637149881541301278JA1コノマンガガスゴイヘンシュウブ4
#recordsWritten:1293462

The above is just an example, the files I'm working with contain all kinds of different languages (and thus encodings). I'm then opening my file with Python 3.6 (an inherited code base that I've upped from Python 2 to Python 3) using the following code:

import os

f = open(file_path, "r")

f.seek(0, os.SEEK_END)
f.seek(f.tell() -40, os.SEEK_SET)
records_str = f.read()
print(records_str)

Using this code, I receive a: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 0: invalid start byte

if I change it to include an encoding:

f = open(file_path, "r", encoding='utf-8'), I receive the same error.

Changing the encoding to utf-16 results in it printing:

랂菣Ꚃ菣Ɩȴ⌊敲潣摲坳楲瑴湥ㄺ㤲㐳㈶ਂ

Which appears to be wrong.

Switching it to open the file in binary mode: f = open(file_path, "rb") results in it outputting:

b'\x82\xb7\xe3\x83\xa5\xe3\x82\xa6\xe3\x83\x96\x014\x02\n#recordsWritten:1293462\x02\n'

Now this is slightly better, however, when I eventually come to processing the file, I don't want to be adding \x82\xb7\xe3\x83\xa5\ to my database, I'd rather add the ガガスゴイヘンシ. So, is there a way to handle Unicode encoded files? I've also looked at the Mozilla chardet project to try and determine encoding, but following code examples, it thinks the file is utf-8 encoded.

Upvotes: 2

Views: 393

Answers (2)

tripleee
tripleee

Reputation: 189337

If you seek into the middle of a UTF-8 sequence, the error message doesn't necessarily mean the data isn't actually UTF-8, just that you can't seek to that exact position and get a useful decoding. "Invalid start byte" means this cannot be the beginning of a valid UTF-8 string.

If you only need to retrieve the last line of the file, maybe just read the entire file and pluck off the last line, or use try/ except until you find a position you can safely seek to. Or simply read part or all of the file as bytes and then decode only the last line.

import os

with open(file_path, "rb") as f:  # notice "b" in "rb"
    f.seek(0, os.SEEK_END)
    f.seek(f.tell() -40, os.SEEK_SET)
    records_bytes = f.read()
records_str = records_bytes.split(b'\n')[-2].decode('ascii')
print(records_str)

We use[-2] on the assumption that the file contains a final newline at the end (i.e. it is a well-formed text file) and so [-1] is simply an empty string, and this retrieves the last actual line.

(Posting this as a separate answer so as not to pollute my other answer, which I hope might also be more useful to future visitors.)

Upvotes: 1

tripleee
tripleee

Reputation: 189337

Without knowledge of the actual bytes in the file, all we can do is speculate.

If the file is not using a single encoding throughout, there is really no way to process it programmatically. You will have to divide it into sections and separately convert each one using whatever encoding is correct for that sequence. This will almost certainly require manual work, if only to establish the boundaries between sections with different encodings.

Going forward, you will probably want to convert everything to a single encoding; my recommendation for that would be UTF-8. It should be able to accommodate anything you can get Python to recognize as a valid string in the first place.

As a crude example, if you know the example you provided uses plain 7-bit ASCII for the Latin sections and EUC-JP for the Japanese characters, maybe try

with open(filename, 'rb') as filebytes:
    raw_bytes = filebytes.read()
string = raw_bytes[0:26].decode('ascii') + \
    raw_bytes[26:54].decode('euc-jp') + \
    raw_bytes[54:].decode('ascii')

I determined the character ranges experimentally from the string you provided; if I guessed wrong which encoding you used for the Japanese text (in particular) they are probably not correct for your actual data.

Observe how we can read bytes from a filehandle opened with rb and Python will not try to apply any character encoding while reading them. But then of course we have to decode them separately with the correct encoding for each if we want to turn this into a string.

Upvotes: 1

Related Questions