Reputation: 1550
I have thousands of tuples of long (8640) lists of integers. For example:
type(l1)
tuple
len(l1)
2
l1[0][:10]
[0, 31, 23, 0, 0, 0, 0, 0, 0, 0]
l1[1][:10]
[0, 0, 11, 16, 24, 0, 0, 0, 0, 0]
I am "pickling" the tuples and it seems that when the tuples are of lists the pickle file is lighter than when are of numpy arrays. I am not that new to python, but by no means I am an expert and I don't really know how the memory is administrated for different types of objects. I would have expected numpy arrays to be lighter, but this is what I obtain when I pickle different types of objects:
#elements in the tuple as a numpy array
l2 = [np.asarray(l1[i]) for i in range(len(l1))]
l2
[array([ 0, 31, 23, ..., 2, 0, 0]), array([ 0, 0, 11, ..., 1, 0, 0])]
#integers in the array are small enough to be saved in two bytes
l3 = [np.asarray(l1[i], dtype='u2') for i in range(len(l1))]
l3
[array([ 0, 31, 23, ..., 2, 0, 0], dtype=uint16),
array([ 0, 0, 11, ..., 1, 0, 0], dtype=uint16)]
#the original tuple of lists
with open('file1.pkl','w') as f:
pickle.dump(l1, f)
#tuple of numpy arrays
with open('file2.pkl','w') as f:
pickle.dump(l2, f)
#tuple of numpy arrays with integers as unsigned 2 bytes
with open('file3.pkl','w') as f:
pickle.dump(l3, f)
and when I check the size of the files:
$du -h file1.pkl
72K file1.pkl
$du -h file2.pkl
540K file2.pkl
$du -h file3.pkl
136K file3.pkl
So even when the integers are saved in two bytes file1 is lighter than file3. I would prefer to use arrays because decompressing arrays (and processing them) is much faster than lists. However, I am going to be storing lots of these tuples (in a pandas data frame) so I would also like to optimise memory as much as possible.
The way I need this to work is, given a list of tuples I do:
#list of pickle objects from pickle.dumps
tpl_pkl = [pickle.dumps(listoftuples[i]) for i in xrange(len(listoftuples))]
#existing pandas data frame. Inserting new column
df['tuples'] = tpl_pkl
Overall my question is: Is there a reason why numpy arrays are taking more space than lists after pickling them into a file?
Maybe if I understand the reason I can find an optimal way of storing arrays.
Thanks in advance for your time.
Upvotes: 8
Views: 11658
Reputation: 1
If the data you provided is close to accurate, this seems like premature optimization to me, as that is really not a lot of data, and supposedly only integers. I am pickling a file right now with millions of entries, of strings and integers, and then you can worry about optimization. In your case the difference likely does not matter that much, especially if this is run manually and does not feed into some webapp or similar.
Upvotes: -2
Reputation: 37043
If you want to store numpy arrays on disk you shouldn't be using pickle
at all. Investigate numpy.save()
and its kin.
If you are using pandas
then it too has its own methods. You might want to consult this article or the answer to this question for better techniques.
Upvotes: 3