DJMcCarthy12
DJMcCarthy12

Reputation: 4119

Opening and reading UTF-16 files in Python

Recently I have been having trouble opening specific UTF-16 encoded files in Python. I have tried the following:

import codecs
f = codecs.open('filename.data', 'r', 'utf-16-be')
contents = f.read()

but I get the following error:

UnicodeDecodeError: 'utf16' codec can't decode bytes in position 18-19: illegal UTF-16 surrogate

after trying to read the contents of the file. I have tried forcing little-endian as well, but that's no good. The file header is as follows:

0x FE FF EE FF

Which I have read denotes UTF-16 Big Endian. I have been able to read the contents of the file into a raw string by using the following:

f = open('filename.data', 'rb')
raw = f.read()
hex = binascii.hexlify(raw)

Which works for getting me the raw hex, but the thing is - sometimes these files will be little-endian, sometimes they will be big-endian so I essentially just want to normalize the data before I start parsing, which I was hoping codecs would be able to help me out with, but no luck..

Does anyone have an idea of what's going on here? I would provide the file(s) as reference but there is some sensitive data so unfortunately I can't. This file is used by Windows OS.

My end goal, as I mentioned above, is to be able to open/read these files and normalize them so that I can use the same parser for all of them, rather than having to write a few parsers with a bunch of error handling in case the encoding is wacky.

EDIT: As requested, the first 32 bytes of the file:

FE FF EE FF 11 22 00 00 03 00 00 00 01 00 00 00 
92 EC DA 48 1B 00 00 00 63 00 3A 00 5C 00 77 00

Upvotes: 7

Views: 21809

Answers (1)

Daniel
Daniel

Reputation: 42748

Looks like you have a header of 24 binary bytes before your utf16-encoded string starts. So you can read the file as binary and decode afterwards:

with open(filename, "rb") as data:
    header = data.read(24)
    text = data.read().decode('utf-16-le')

But probably there are other binary parts. Without knowing the exact file format, there cannot be given more help.

Upvotes: 4

Related Questions