Reputation: 567
I am playing around with the MNIST dataset and I have encountered the following, which I don't quite understand. According to the documentation the data is formatted as follows:
[offset] [type] [value] [description]
0000 32 bit integer 0x00000801(2049) magic number (MSB first)
0004 32 bit integer 60000 number of items
0008 unsigned byte ?? label
0009 unsigned byte ?? label
........
xxxx unsigned byte ?? label
The labels values are 0 to 9.
Thus, I would expect bytes 4-8, corresponding to the number of items (60,000) to be:
struct.pack('i', 60000)
>> '`\xea\x00\x00'
However, when I read the file byte-by-byte, it looks like they are in reverse order:
with gzip.open(path_to_file, 'rb') as f:
print struct.unpack('cccc', f.read(4))
for i in range(4):
print struct.unpack('c', f.read(1))
>> ('\x00', '\x00', '\x08', '\x01')
>> ('\x00', '\x00', '\xea', '`')
Clearly, I can reverse them to get the expected order, but I am confused as to why the bytes seem to reversed.
Upvotes: 0
Views: 1480
Reputation: 77837
This is an artifact of byte ordering within a word. The data is formatted as an integer, so you'r esupposed to read it that way. This is "little-endian" addressing, the lowest (earliest) address having the least significant byte. Note that in the first field, the format specified is "MSB first".
Upvotes: 1