Reputation: 475
I'm trying to read Windows-generated UTF-16 files with Python. From what I understand, BOMs are FEFF. That's what this file starts with. However, when I read the file into Python, the bytes seem to get swapped.
(venv) [user]:~/consolidate$ head -c 16 temp.txt | od -x
0000000 feff 0022 0076 0065 0072 0073 0069 006f
0000020
(venv) [user]:~/consolidate$ python
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> with open('temp.txt', 'rb') as f:
... str = f.readline()
... print(str)
...
b'\xff\xfe"\x00v\x00e\x00r\x00s\x00i\x00o\x00n...
Using head
, the first characters are feff 0022
. Using Python, it appears to be fffe2200
. What's going on here?
EDIT: my question is specifically about byte order. A few points:
Example second line reading:
>>> with open('temp.txt', 'rb') as f:
... str1 = f.readline()
... str2 = f.readline()
...
>>> str2
b'\x00"\x00"\x00`\x00"\x00P\x
Upvotes: 0
Views: 1802
Reputation: 159081
There are three separate similar things going on here. The file is a sequence of bytes, and the Python byte string b'\xff\xfe"\x00v\x00e\x00...'
shows things in the same order the bytes are in in the file:
FF FE 22 00 76 00 65 00
When you ran od -x
, it grouped pairs of bytes into 16-bit numbers. On x86 systems the standard byte ordering for 2-byte 16-bit numbers is for the least-significant byte ("ones byte") to be first and the most-significant byte ("256s byte") to be second (in Python, n=b[0]+256*b[1]
). So you get this little-endian decoding:
FEFF 0022 0076 0065
Meanwhile, you want to decode this into Unicode characters. So long as no character is above U+FFFF, the UTF-16 little-endian (UTF-16LE) encoding translates that same decoding into Unicode characters:
U+FEFF U+0022 U+0076 U+0065
<BOM> " v e
What happens at the end of the line? Let's consider the string u'...",\n ...'
and do this exercise in reverse order.
" , \n <SPC>
U+0022 U+002C U+000A U+0020
22 00 2C 00 0A 00 20 00
b'"\x00,\x00\n\x00 \x00'
Meanwhile: what happens if you don't actually think about the character encoding, and "split this on newlines"? You'd see [b'"\x00,\x00"', b'\n', b'\x00 \x00']
. That looks like the first part is little-endian byte order (quote null comma null) but the last part is big-endian (null space). But the second half isn't actually a valid UTF-16 string: it contains odd numbers of bytes, because the first byte is actually the second half of the newline. That's what's happening when you call readline
.
You have a couple of options to deal with this. One, mentioned in another answer, is to open(filename, 'r', encoding='utf-16')
(without a "b" in the file mode). Then Python will do the correct UTF-16 decoding (taking the byte-order mark into account) and you will get a character string. Calls like str.readline
will also do what you expect here.
You also said your goal is just to split the file. If you know with absolute certainty that the file is UTF-16LE encoded (the first two bytes are definitely FF FE) then you could process it as a byte string (with mode 'rb'
as in the code in the question) and split it on the UTF-16-encoded byte sequence you want
everything = f.read()
lines = everything.split(b'\x0A\x00')
for line in lines:
parts = line.split(b'\x3A\x26')
This is easier to do if you can read the entire file in one chunk; at 10 GB that could be tricky in Python.
Upvotes: 1
Reputation: 1318
You can explicitly decode as low-endian with utf-16-le
and you receive the BOM as expected:
>>> b'\xff\xfe"\x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00'.decode('utf-16-le')
'\ufeff"version'
If you decode with utf-16
it already removes the BOM:
>>> b'\xff\xfe"\x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00'.decode('utf-16')
'"version'
Upvotes: 0
Reputation: 837
Add encoding='utf-16' parameter to open with
open('temp.txt', 'r', encoding='utf-16')
Upvotes: 0