Muppet
Muppet

Reputation: 6029

Why does a numpy array with dtype=object result in a much smaller file size than dtype=int?

Here an example:

import numpy as np
randoms = np.random.randint(0, 20, 10000000)

a = randoms.astype(np.int)
b = randoms.astype(np.object)

np.save('d:/dtype=int.npy', a)     #39 mb
np.save('d:/dtype=object.npy', b)  #19 mb! 

You can see that the file with dtype=object is about half the size. How come? I was under the impression that properly defined numpy dtypes are strictly better than object dtypes.

Upvotes: 5

Views: 2440

Answers (2)

Robert Kern
Robert Kern

Reputation: 13430

EDIT: This analysis is wrong. See user2357112's answer for the correct one.

dtype=object arrays are saved as a Python pickle inside the NPY file. Python pickles preserve identity for objects inside its object graph; i.e. if b[i] is b[j] then the pickle will serialize the object referred to by b[i] and b[j] just the first time and refer to it when it comes to the next occurrence. That reference is often smaller than the serialized object itself, even when the objects themselves are pretty small when serialized.

Python optimizes small integers such that it will always reuses the same object for integers from -5 to 256, thus including all range(0, 20) which are the only values in your array. numpy may also decide to reuse instances when it converts via .astype(object).

If you created an array where most or all of the values are unique, like with the floating point uniform(0.0, 1.0, 10000000), then you would get the relative sizes that you expect.

Upvotes: 2

user2357112
user2357112

Reputation: 280335

With a non-object dtype, most of the npy file format consists of a dump of the raw bytes of the array's data. That'd be either 4 or 8 bytes per element here, depending on whether your NumPy defaults to 4- or 8-byte integers. From the file size, it looks like 4 bytes per element.

With an object dtype, most of the npy file format consists of an ordinary pickle of the array. For small integers, such as those in your array, the pickle uses the K pickle opcode, long name BININT1, "documented" in the pickletools module:

I(name='BININT1',
  code='K',
  arg=uint1,
  stack_before=[],
  stack_after=[pyint],
  proto=1,
  doc="""Push a one-byte unsigned integer.

  This is a space optimization for pickling very small non-negative ints,
  in range(256).
  """),

This requires two bytes per integer, one for the K opcode and one byte of unsigned integer data.

Note that you could have cut down the file size even further by storing your array with dtype numpy.int8 or numpy.uint8, for roughly 1 byte per integer.

Upvotes: 8

Related Questions