Reputation: 1448
I have a large 3d numpy array that I'd like to preserve. My first approach is simply to use pickle, but this seems to lead to a poorly explained error.
test_rand = np.random.random((100000,200,50))
with open('models/test.pkl', 'wb') as save_file:
pickle.dump(test_rand, save_file, -1)
---------------------------------------------------------------------------
error Traceback (most recent call last)
<ipython-input-18-511e30b08440> in <module>()
1 with open('models/test.pkl', 'wb') as save_file:
----> 2 pickle.dump(test_rand, save_file, -1)
3
C:\Users\g1dak02\AppData\Local\Continuum\Anaconda\lib\pickle.pyc in dump(obj, file, protocol)
1368
1369 def dump(obj, file, protocol=None):
-> 1370 Pickler(file, protocol).dump(obj)
1371
1372 def dumps(obj, protocol=None):
C:\Users\g1dak02\AppData\Local\Continuum\Anaconda\lib\pickle.pyc in dump(self, obj)
222 if self.proto >= 2:
223 self.write(PROTO + chr(self.proto))
--> 224 self.save(obj)
225 self.write(STOP)
226
C:\Users\g1dak02\AppData\Local\Continuum\Anaconda\lib\pickle.pyc in save(self, obj)
329
330 # Save the reduce() output and finally memoize the object
--> 331 self.save_reduce(obj=obj, *rv)
332
333 def persistent_id(self, obj):
C:\Users\g1dak02\AppData\Local\Continuum\Anaconda\lib\pickle.pyc in save_reduce(self, func, args, state, listitems, dictitems, obj)
417
418 if state is not None:
--> 419 save(state)
420 write(BUILD)
421
C:\Users\g1dak02\AppData\Local\Continuum\Anaconda\lib\pickle.pyc in save(self, obj)
284 f = self.dispatch.get(t)
285 if f:
--> 286 f(self, obj) # Call unbound method with explicit self
287 return
288
C:\Users\g1dak02\AppData\Local\Continuum\Anaconda\lib\pickle.pyc in save_tuple(self, obj)
560 write(MARK)
561 for element in obj:
--> 562 save(element)
563
564 if id(obj) in memo:
C:\Users\g1dak02\AppData\Local\Continuum\Anaconda\lib\pickle.pyc in save(self, obj)
284 f = self.dispatch.get(t)
285 if f:
--> 286 f(self, obj) # Call unbound method with explicit self
287 return
288
C:\Users\g1dak02\AppData\Local\Continuum\Anaconda\lib\pickle.pyc in save_string(self, obj, pack)
484 self.write(SHORT_BINSTRING + chr(n) + obj)
485 else:
--> 486 self.write(BINSTRING + pack("<i", n) + obj)
487 else:
488 self.write(STRING + repr(obj) + '\n')
error: integer out of range for 'i' format code
So the two questions I have are as follows:
I am using Python 2.7.8 and NumPy 1.9.0.
Upvotes: 3
Views: 12856
Reputation: 35217
With regard to #1, it's a bug… and an old one at that. There's an enlightening, albeit surprisingly old, discussion about this here: http://python.6.x6.nabble.com/test-gzip-test-tarfile-failure-om-AMD64-td1830323.html
The reasons for the error are here: http://www.littleredbat.net/mk/files/grimoire.html#contents_item_2.1
The simplest and most basic type are integers, which are represented as a C long. Their size is therefore dependent on the platform you're using; on a 32-bit machine, they can range from -2147483647 to 2147483647. Python programs can determine the highest possible value for an integer by looking at sys.maxint; the lowest possible value will usually be -sys.maxint - 1.
This error is not a common one, as most people when faced with a very large numpy
array, will use np.save
or np.savez
to take advantage of the reduced pickle format for numpy
arrays (see the __reduce__
method for a numpy
array, which is what np.save
calls under the covers).
To show that it's just about the array being too large for pickle
…
>>> import numpy as np
>>> import pickle
>>> test_rand = np.random.random((100000,200,50))
>>> x = pickle.dumps(test_rand[:20000], -1)
>>> x = pickle.dumps(test_rand[:30000], -1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.3.dev0-py2.7.egg/dill/dill.py", line 194, in dumps
dump(obj, file, protocol, byref, fmode)#, strictio)
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.3.dev0-py2.7.egg/dill/dill.py", line 184, in dump
pik.dump(obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.3.dev0-py2.7.egg/dill/dill.py", line 181, in save_numpy_array
pik.save_reduce(_create_array, (f, args, state, npdict), obj=obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 401, in save_reduce
save(args)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 562, in save_tuple
save(element)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 562, in save_tuple
save(element)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 486, in save_string
self.write(BINSTRING + pack("<i", n) + obj)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
>>>
however, this works for the full array...
>>> x = test_rand.__reduce__()
>>> type(x)
<type 'tuple'>
>>> x[0]
<built-in function _reconstruct>
>>> x[1]
(<type 'numpy.ndarray'>, (0,), 'b')
>>> x[2][0:3]
(1, (100000, 200, 50), dtype('float64'))
>>> len(x[2][4])
8000000000
>>> x[2][4][:100]
'Y\xa4}\xdf\x84\xdf\xe1?\xfe\x1fd\xe3\xf2\xab\xe2?\x80\xe4\xfe\x17\xfb\xd6\xc2?\xd73\x92\xc9N]\xe8?\x90\xbc\xe3@\xdcO\xc9?\x18\x9dX\x12MG\xc4?(\x0f\x8f\xf9}\xf6\xb1?\xd0\x90O\xe2\x9b\xf1\xed?_\x99\x06\xacY\x9e\xe2?\xe7\xf8\x15\xa8\x13\x91\xe2?\x96}\xffH\xda\xc3\xd4?@\t\xae_"\xe0\xda?y<%\x8a'
And if you'd like to burn out your fan, print x
.
What you'll also notice is the function in x[0]
gets saved along with the data. It's a self-contained function that can produce a numpy array from the pickled data.
Upvotes: 9
Reputation: 705
As an alternative to pickle
, especially for very large datasets, you may wish to consider a Python interface to a binary data format such as HDF5 (e.g., h5py). For a discussion of its pros and cons, see this question and the first answer.
Upvotes: 2
Reputation: 3877
To answer the first question, "What is actually going on in this error?", here is my guess.
Pickle is trying to save your NumPy array as packed binary data. It's saving each integer as a four-byte signed integer (the i
code). However, numpy.random.random
creates floats (which should be eight-byte d
s rather than four-byte i
s). I have no idea why pickle would do it this way. It's also entirely possible that the i
actually is for saving some other piece of information than one of the values of your array. I'm just guessing that the error arises because a value of your array does not fit in four bytes.
What versions of Python and NumPy are you using?
Upvotes: 1