ytrewq
ytrewq

Reputation: 3930

Python pickle file strangely large

I made a pickle file, storing a grayscale value of each pixel in 100,000 80x80 sized images.

(Plus an array of 100,000 integers whose values are one-digit).

My approximation for the total size of the pickle is,

4 byte x 80 x 80 x 100000 = 2.88 GB 

plus the array of integers, which shouldn't be that large.

The generated pickle file however is over 16GB, so it's taking hours just to unpickle it and load it, and it eventually freezes, after it takes full memory resources.

Is there something wrong with my calculation or is it the way I pickled it?

I pickled the file in the following way.

from PIL import Image
import pickle
import os
import numpy
import time

trainpixels = numpy.empty([80000,6400])
trainlabels = numpy.empty(80000)
validpixels = numpy.empty([10000,6400])
validlabels = numpy.empty(10000)
testpixels = numpy.empty([10408,6400])
testlabels = numpy.empty(10408)

i=0
tr=0
va=0
te=0
for (root, dirs, filenames) in os.walk(indir1):
    print 'hello'
    for f in filenames:
        try:
                im = Image.open(os.path.join(root,f))
                Imv=im.load()
                x,y=im.size
                pixelv = numpy.empty(6400)
                ind=0
                for ii in range(x):
                        for j in range(y):
                                temp=float(Imv[j,ii])
                                temp=float(temp/255.0)
                                pixelv[ind]=temp
                                ind+=1
                if i<40000:
                        trainpixels[tr]=pixelv
                        tr+=1
                elif i<45000:
                        validpixels[va]=pixelv
                        va+=1
                else:
                        testpixels[te]=pixelv
                        te+=1
                print str(i)+'\t'+str(f)
                i+=1
        except IOError:
                continue
trainimage=(trainpixels,trainlabels)
validimage=(validpixels,validlabels)
testimage=(testpixels,testlabels)

output=open('data.pkl','wb')

pickle.dump(trainimage,output)
pickle.dump(validimage,output)
pickle.dump(testimage,output)

Please let me know if you see something wrong with either my calculation or my code!

Upvotes: 3

Views: 6712

Answers (1)

user559633
user559633

Reputation:

Python Pickles are not a thrifty mechanism for storing data as you're storing objects instead of "just the data."

The following test case takes 24kb on my system and this is for a small, sparsely populated numpy array stored in a pickle:

import os
import sys
import numpy
import pickle

testlabels = numpy.empty(1000)
testlabels[0] = 1
testlabels[99] = 0

test_labels_size = sys.getsizeof(testlabels) #80

output = open('/tmp/pickle', 'wb')
test_labels_pickle = pickle.dump(testlabels, output)

print os.path.getsize('/tmp/pickle')

Further, I'm not sure why you believe 4kb to be the size of a number in Python -- non-numpy ints are 24 bytes (sys.getsizeof(1)) and numpy arrays are a minimum of 80 bytes (sys.getsizeof(numpy.array([0], float))).

As you stated as a response to my comment, you have reasons for staying with Pickle, so I won't try to convince you further to not store objects, but be aware of the overhead of storing objects.

As an option: reduce the size of your training data/Pickle fewer objects.

Upvotes: 2

Related Questions