Reputation: 4119
Recently I have been having trouble opening specific UTF-16 encoded files in Python. I have tried the following:
import codecs
f = codecs.open('filename.data', 'r', 'utf-16-be')
contents = f.read()
but I get the following error:
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 18-19: illegal UTF-16 surrogate
after trying to read the contents of the file. I have tried forcing little-endian as well, but that's no good. The file header is as follows:
0x FE FF EE FF
Which I have read denotes UTF-16 Big Endian. I have been able to read the contents of the file into a raw string by using the following:
f = open('filename.data', 'rb')
raw = f.read()
hex = binascii.hexlify(raw)
Which works for getting me the raw hex, but the thing is - sometimes these files will be little-endian, sometimes they will be big-endian so I essentially just want to normalize the data before I start parsing, which I was hoping codecs would be able to help me out with, but no luck..
Does anyone have an idea of what's going on here? I would provide the file(s) as reference but there is some sensitive data so unfortunately I can't. This file is used by Windows OS.
My end goal, as I mentioned above, is to be able to open/read these files and normalize them so that I can use the same parser for all of them, rather than having to write a few parsers with a bunch of error handling in case the encoding is wacky.
EDIT: As requested, the first 32 bytes of the file:
FE FF EE FF 11 22 00 00 03 00 00 00 01 00 00 00
92 EC DA 48 1B 00 00 00 63 00 3A 00 5C 00 77 00
Upvotes: 7
Views: 21809
Reputation: 42748
Looks like you have a header of 24 binary bytes before your utf16-encoded string starts. So you can read the file as binary and decode afterwards:
with open(filename, "rb") as data:
header = data.read(24)
text = data.read().decode('utf-16-le')
But probably there are other binary parts. Without knowing the exact file format, there cannot be given more help.
Upvotes: 4