terry87
terry87

Reputation: 475

Python 3.x reading UTF-16 files seems to reverse byte order

I'm trying to read Windows-generated UTF-16 files with Python. From what I understand, BOMs are FEFF. That's what this file starts with. However, when I read the file into Python, the bytes seem to get swapped.

(venv) [user]:~/consolidate$ head -c 16 temp.txt | od -x
0000000 feff 0022 0076 0065 0072 0073 0069 006f
0000020
(venv) [user]:~/consolidate$ python
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> with open('temp.txt', 'rb') as f:
...     str = f.readline()
...     print(str)
...
b'\xff\xfe"\x00v\x00e\x00r\x00s\x00i\x00o\x00n...

Using head, the first characters are feff 0022. Using Python, it appears to be fffe2200. What's going on here?

EDIT: my question is specifically about byte order. A few points:

Example second line reading:

>>> with open('temp.txt', 'rb') as f:
...     str1 = f.readline()
...     str2 = f.readline()
...
>>> str2
b'\x00"\x00"\x00`\x00"\x00P\x

Upvotes: 0

Views: 1802

Answers (3)

David Maze
David Maze

Reputation: 159081

There are three separate similar things going on here. The file is a sequence of bytes, and the Python byte string b'\xff\xfe"\x00v\x00e\x00...' shows things in the same order the bytes are in in the file:

FF FE 22 00 76 00 65 00

When you ran od -x, it grouped pairs of bytes into 16-bit numbers. On x86 systems the standard byte ordering for 2-byte 16-bit numbers is for the least-significant byte ("ones byte") to be first and the most-significant byte ("256s byte") to be second (in Python, n=b[0]+256*b[1]). So you get this little-endian decoding:

FEFF  0022  0076  0065

Meanwhile, you want to decode this into Unicode characters. So long as no character is above U+FFFF, the UTF-16 little-endian (UTF-16LE) encoding translates that same decoding into Unicode characters:

U+FEFF U+0022 U+0076 U+0065
<BOM>     "      v      e

What happens at the end of the line? Let's consider the string u'...",\n ...' and do this exercise in reverse order.

   "      ,     \n   <SPC>
U+0022 U+002C U+000A U+0020
22 00  2C 00  0A 00  20 00
b'"\x00,\x00\n\x00 \x00'

Meanwhile: what happens if you don't actually think about the character encoding, and "split this on newlines"? You'd see [b'"\x00,\x00"', b'\n', b'\x00 \x00']. That looks like the first part is little-endian byte order (quote null comma null) but the last part is big-endian (null space). But the second half isn't actually a valid UTF-16 string: it contains odd numbers of bytes, because the first byte is actually the second half of the newline. That's what's happening when you call readline.

You have a couple of options to deal with this. One, mentioned in another answer, is to open(filename, 'r', encoding='utf-16') (without a "b" in the file mode). Then Python will do the correct UTF-16 decoding (taking the byte-order mark into account) and you will get a character string. Calls like str.readline will also do what you expect here.

You also said your goal is just to split the file. If you know with absolute certainty that the file is UTF-16LE encoded (the first two bytes are definitely FF FE) then you could process it as a byte string (with mode 'rb' as in the code in the question) and split it on the UTF-16-encoded byte sequence you want

everything = f.read()
lines = everything.split(b'\x0A\x00')
for line in lines:
  parts = line.split(b'\x3A\x26')

This is easier to do if you can read the entire file in one chunk; at 10 GB that could be tricky in Python.

Upvotes: 1

Michael Str&#246;der
Michael Str&#246;der

Reputation: 1318

You can explicitly decode as low-endian with utf-16-le and you receive the BOM as expected:

>>> b'\xff\xfe"\x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00'.decode('utf-16-le')
'\ufeff"version'

If you decode with utf-16 it already removes the BOM:

>>> b'\xff\xfe"\x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00'.decode('utf-16')
'"version'

Upvotes: 0

jits_on_moon
jits_on_moon

Reputation: 837

Add encoding='utf-16' parameter to open with
open('temp.txt', 'r', encoding='utf-16')

Upvotes: 0

Related Questions