Why does a numpy array with dtype=object result in a much smaller file size than dtype=int?

Question

Here an example:

import numpy as np
randoms = np.random.randint(0, 20, 10000000)

a = randoms.astype(np.int)
b = randoms.astype(np.object)

np.save('d:/dtype=int.npy', a)     #39 mb
np.save('d:/dtype=object.npy', b)  #19 mb!

You can see that the file with dtype=object is about half the size. How come? I was under the impression that properly defined numpy dtypes are strictly better than object dtypes.

user2357112 · Accepted Answer

With a non-object dtype, most of the npy file format consists of a dump of the raw bytes of the array's data. That'd be either 4 or 8 bytes per element here, depending on whether your NumPy defaults to 4- or 8-byte integers. From the file size, it looks like 4 bytes per element.

With an object dtype, most of the npy file format consists of an ordinary pickle of the array. For small integers, such as those in your array, the pickle uses the K pickle opcode, long name BININT1, "documented" in the pickletools module:

I(name='BININT1',
  code='K',
  arg=uint1,
  stack_before=[],
  stack_after=[pyint],
  proto=1,
  doc="""Push a one-byte unsigned integer.

  This is a space optimization for pickling very small non-negative ints,
  in range(256).
  """),

This requires two bytes per integer, one for the K opcode and one byte of unsigned integer data.

Note that you could have cut down the file size even further by storing your array with dtype numpy.int8 or numpy.uint8, for roughly 1 byte per integer.

Why does a numpy array with dtype=object result in a much smaller file size than dtype=int?

Answers (2)

Related Questions