LucG
LucG

Reputation: 1334

Python gzip module doesn't work as expected on ubyte file

I expected the following code

import gzip
import numpy as np

def read_ubyte(self, fname):
    with gzip.open(fname, 'rb') as flbl:
        magic, num = struct.unpack(">II", flbl.read(8))
        lbl = np.fromfile(flbl, dtype=np.int8)
    return magic, num, lbl

if __name__ == "__main__":
    print(read_ubyte("train-labels-idx1-ubyte.gz"))

to work exactly the same as first doing gunzip train-labels-idx1-ubyte.gz then executing

import numpy as np

def read_ubyte(self, fname):
    with open(fname, 'rb') as flbl:
        magic, num = struct.unpack(">II", flbl.read(8))
        lbl = np.fromfile(flbl, dtype=np.int8)
    return magic, num, lbl

if __name__ == "__main__":
    print(read_ubyte("train-labels-idx1-ubyte"))

but it doesn't, the first code gives output:

(2049, 60000, array([  0,   3, 116, ..., -22,   0,   0], dtype=int8))

and the second

(2049, 60000, array([5, 0, 4, ..., 5, 6, 8], dtype=int8))

Why?

note 1: the second is the right output (with no gzip module usage)

note 2: the numbers 2049 and 60000 are right

note 3: If you want to reproduce, you can download the file at http://yann.lecun.com/exdb/mnist/

Upvotes: 1

Views: 230

Answers (1)

John Zwinck
John Zwinck

Reputation: 249283

NumPy and GZip disagree about file object semantics. This is a known issue, which some parts of NumPy (like np.load()) accommodate, but fromfile() does not.

To work around it (only needed in the gzip case, but works in both):

    lbl = np.fromstring(flbl.read(), dtype=np.int8)

Upvotes: 3

Related Questions