user2543682
user2543682

Reputation: 353

Pickle dump huge file without memory error

I have a program where I basically adjust the probability of certain things happening based on what is already known. My file of data is already saved as a pickle Dictionary object at Dictionary.txt.

The problem is that everytime that I run the program it pulls in the Dictionary.txt, turns it into a dictionary object, makes it's edits and overwrites Dictionary.txt. This is pretty memory intensive as the Dictionary.txt is 123 MB. When I dump I am getting the MemoryError, everything seems fine when I pull it in..

Thank you for your time.

Upvotes: 35

Views: 65168

Answers (11)

pb08
pb08

Reputation: 1

I found the problem to be the machine memory.

I was dumping a very large python list.... each list item had 54 elements of mixed type and there were up to ~1.7M of them.... At about 400K, it started to produce a memory error.

I had the luxury of working on a cluster that I could specify the system memory when in batch mode and found that I had to increase the requested memory - in my case, up to 50Gb and the memory error disappeared.

Upvotes: 0

W. Dan
W. Dan

Reputation: 1027

I have tried the following solution, but all of them can't resolve my problem.

  1. Using hickle to replace pickle
  2. Using joblib to replace pickle
  3. Using sklearn.externals joblib to replace pickle
  4. Change the pickle mode

Provide a different method for this issue:

Finally, I found the root cause is that the work directory folder was too long.
So that I change the directory to a very short structure.

Enjoy it.

Upvotes: 0

Ch HaXam
Ch HaXam

Reputation: 499

I was having the same issue. I use joblib and work was done. In case if someone wants to know other possibilities.

save the model to disk

from sklearn.externals import joblib
filename = 'finalized_model.sav'
joblib.dump(model, filename)  

some time later... load the model from disk

loaded_model = joblib.load(filename)
result = loaded_model.score(X_test, Y_test) 

print(result)

Upvotes: 21

lyron
lyron

Reputation: 246

This may seem trivial, but try to use the 64bit Python if you are not.

Upvotes: 1

gidim
gidim

Reputation: 2323

None of the above answers worked for me. I ended up using Hickle which is a drop-in replacement for pickle based on HDF5. Instead of saving it to a pickle it's saving the data to HDF5 file. The API is identical for most use cases and it has some really cool features such as compression.

pip install hickle

Example:

# Create a numpy array of data
array_obj = np.ones(32768, dtype='float32')

# Dump to file
hkl.dump(array_obj, 'test.hkl', mode='w')

# Load data
array_hkl = hkl.load('test.hkl')

Upvotes: 5

Mike McKerns
Mike McKerns

Reputation: 35247

I am the author of a package called klepto (and also the author of dill). klepto is built to store and retrieve objects in a very simple way, and provides a simple dictionary interface to databases, memory cache, and storage on disk. Below, I show storing large objects in a "directory archive", which is a filesystem directory with one file per entry. I choose to serialize the objects (it's slower, but uses dill, so you can store almost any object), and I choose a cache. Using a memory cache enables me to have fast access to the directory archive, without having to have the entire archive in memory. Interacting with a database or file can be slow, but interacting with memory is fast… and you can populate the memory cache as you like from the archive.

>>> import klepto
>>> d = klepto.archives.dir_archive('stuff', cached=True, serialized=True)
>>> d
dir_archive('stuff', {}, cached=True)
>>> import numpy
>>> # add three entries to the memory cache
>>> d['big1'] = numpy.arange(1000)
>>> d['big2'] = numpy.arange(1000)
>>> d['big3'] = numpy.arange(1000)
>>> # dump from memory cache to the on-disk archive
>>> d.dump()
>>> # clear the memory cache
>>> d.clear()
>>> d
dir_archive('stuff', {}, cached=True)
>>> # only load one entry to the cache from the archive
>>> d.load('big1')
>>> d['big1'][-3:]
array([997, 998, 999])
>>> 

klepto provides fast and flexible access to large amounts of storage, and if the archive allows parallel access (e.g. some databases) then you can read results in parallel. It's also easy to share results in different parallel processes or on different machines. Here I create a second archive instance, pointed at the same directory archive. It's easy to pass keys between the two objects, and works no differently from a different process.

>>> f = klepto.archives.dir_archive('stuff', cached=True, serialized=True)
>>> f
dir_archive('stuff', {}, cached=True)
>>> # add some small objects to the first cache  
>>> d['small1'] = lambda x:x**2
>>> d['small2'] = (1,2,3)
>>> # dump the objects to the archive
>>> d.dump()
>>> # load one of the small objects to the second cache
>>> f.load('small2')
>>> f       
dir_archive('stuff', {'small2': (1, 2, 3)}, cached=True)

You can also pick from various levels of file compression, and whether you want the files to be memory-mapped. There are a lot of different options, both for file backends and database backends. The interface is identical, however.

With regard to your other questions about garbage collection and editing of portions of the dictionary, both are possible with klepto, as you can individually load and remove objects from the memory cache, dump, load, and synchronize with the archive backend, or any of the other dictionary methods.

See a longer tutorial here: https://github.com/mmckerns/tlkklp

Get klepto here: https://github.com/uqfoundation

Upvotes: 18

Andrew Scott Evans
Andrew Scott Evans

Reputation: 1033

I recently had this problem. After trying cpickle with ASCII and the binary protocol 2, I found that my SVM from sci-kit learn trained on 20+ gb of data was not pickling due to a memory error. However, the dill package seemed to resolve the issue. Dill will not create many improvements for a dictionary but may help with streaming. It is meant to stream pickled bytes across a network.

import dill

with open(path,'wb') as fp:
    dill.dump(outpath,fp)
    dill.load(fp)

If efficiency is an issue, try loading/saving to a database. In this instance, your storage solution may be an issue. At 123 mb Pandas should be fine. However, if the machine has limited memory SQL offers fast,optimized, bag operations over data, usually with multithreaded support. My poly kernel svm saved.

Upvotes: 2

den.run.ai
den.run.ai

Reputation: 5943

I had memory error and resolved it by using protocol=2:

cPickle.dump(obj, file, protocol=2)

Upvotes: 4

richie
richie

Reputation: 18648

How about this?

import cPickle as pickle
p = pickle.Pickler(open("temp.p","wb")) 
p.fast = True 
p.dump(d) # d could be your dictionary or any file

Upvotes: 2

Chris Wheadon
Chris Wheadon

Reputation: 840

Have you tried using streaming pickle: https://code.google.com/p/streaming-pickle/

I have just solved a similar memory error by switching to streaming pickle.

Upvotes: 2

Imran
Imran

Reputation: 91039

If your key and values are string, you can use one of the embedded persistent key-value storage engines available in Python standard library. Example from the anydbm module docs:

import anydbm

# Open database, creating it if necessary.
db = anydbm.open('cache', 'c')

# Record some values
db['www.python.org'] = 'Python Website'
db['www.cnn.com'] = 'Cable News Network'

# Loop through contents.  Other dictionary methods
# such as .keys(), .values() also work.
for k, v in db.iteritems():
    print k, '\t', v

# Storing a non-string key or value will raise an exception (most
# likely a TypeError).
db['www.yahoo.com'] = 4

# Close when done.
db.close()

Upvotes: 2

Related Questions