Javier
Javier

Reputation: 1550

Pickle file size when pickling numpy arrays or lists

I have thousands of tuples of long (8640) lists of integers. For example:

type(l1)
tuple

len(l1)
2

l1[0][:10]
[0, 31, 23, 0, 0, 0, 0, 0, 0, 0]

l1[1][:10]
[0, 0, 11, 16, 24, 0, 0, 0, 0, 0] 

I am "pickling" the tuples and it seems that when the tuples are of lists the pickle file is lighter than when are of numpy arrays. I am not that new to python, but by no means I am an expert and I don't really know how the memory is administrated for different types of objects. I would have expected numpy arrays to be lighter, but this is what I obtain when I pickle different types of objects:

#elements in the tuple as a numpy array
l2 = [np.asarray(l1[i]) for i in range(len(l1))]
l2
[array([ 0, 31, 23, ...,  2,  0,  0]), array([ 0,  0, 11, ...,  1,  0,  0])]

#integers in the array are small enough to be saved in two bytes
l3 = [np.asarray(l1[i], dtype='u2') for i in range(len(l1))]
l3
[array([ 0, 31, 23, ...,  2,  0,  0], dtype=uint16),
 array([ 0,  0, 11, ...,  1,  0,  0], dtype=uint16)]

#the original tuple of lists
with open('file1.pkl','w') as f:
     pickle.dump(l1, f)

#tuple of numpy arrays
with open('file2.pkl','w') as f:
    pickle.dump(l2, f)

#tuple of numpy arrays with integers as unsigned 2 bytes
with open('file3.pkl','w') as f:
    pickle.dump(l3, f)

and when I check the size of the files:

 $du -h file1.pkl
  72K   file1.pkl

 $du -h file2.pkl
  540K  file2.pkl

 $du -h file3.pkl
 136K   file3.pkl

So even when the integers are saved in two bytes file1 is lighter than file3. I would prefer to use arrays because decompressing arrays (and processing them) is much faster than lists. However, I am going to be storing lots of these tuples (in a pandas data frame) so I would also like to optimise memory as much as possible.

The way I need this to work is, given a list of tuples I do:

#list of pickle objects from pickle.dumps
tpl_pkl = [pickle.dumps(listoftuples[i]) for i in xrange(len(listoftuples))]

#existing pandas data frame. Inserting new column 
df['tuples'] = tpl_pkl

Overall my question is: Is there a reason why numpy arrays are taking more space than lists after pickling them into a file?

Maybe if I understand the reason I can find an optimal way of storing arrays.

Thanks in advance for your time.

Upvotes: 8

Views: 11658

Answers (2)

step21
step21

Reputation: 1

If the data you provided is close to accurate, this seems like premature optimization to me, as that is really not a lot of data, and supposedly only integers. I am pickling a file right now with millions of entries, of strings and integers, and then you can worry about optimization. In your case the difference likely does not matter that much, especially if this is run manually and does not feed into some webapp or similar.

Upvotes: -2

holdenweb
holdenweb

Reputation: 37043

If you want to store numpy arrays on disk you shouldn't be using pickle at all. Investigate numpy.save() and its kin.

If you are using pandas then it too has its own methods. You might want to consult this article or the answer to this question for better techniques.

Upvotes: 3

Related Questions