Reputation: 1334
I expected the following code
import gzip
import numpy as np
def read_ubyte(self, fname):
with gzip.open(fname, 'rb') as flbl:
magic, num = struct.unpack(">II", flbl.read(8))
lbl = np.fromfile(flbl, dtype=np.int8)
return magic, num, lbl
if __name__ == "__main__":
print(read_ubyte("train-labels-idx1-ubyte.gz"))
to work exactly the same as first doing gunzip train-labels-idx1-ubyte.gz
then executing
import numpy as np
def read_ubyte(self, fname):
with open(fname, 'rb') as flbl:
magic, num = struct.unpack(">II", flbl.read(8))
lbl = np.fromfile(flbl, dtype=np.int8)
return magic, num, lbl
if __name__ == "__main__":
print(read_ubyte("train-labels-idx1-ubyte"))
but it doesn't, the first code gives output:
(2049, 60000, array([ 0, 3, 116, ..., -22, 0, 0], dtype=int8))
and the second
(2049, 60000, array([5, 0, 4, ..., 5, 6, 8], dtype=int8))
Why?
note 1: the second is the right output (with no gzip
module usage)
note 2: the numbers 2049 and 60000 are right
note 3: If you want to reproduce, you can download the file at http://yann.lecun.com/exdb/mnist/
Upvotes: 1
Views: 230
Reputation: 249283
NumPy and GZip disagree about file object semantics. This is a known issue, which some parts of NumPy (like np.load()
) accommodate, but fromfile()
does not.
To work around it (only needed in the gzip
case, but works in both):
lbl = np.fromstring(flbl.read(), dtype=np.int8)
Upvotes: 3