Reputation: 10322
I am trying to cPickle a large scipy sparse matrix for later use. I am getting this error:
File "tfidf_scikit.py", line 44, in <module>
pickle.dump([trainID, trainX, trainY], fout, protocol=-1)
SystemError: error return without exception set
trainX
is the large sparse matrix, the other two are lists 6mil elements long.
In [1]: trainX
Out[1]:
<6034195x755258 sparse matrix of type '<type 'numpy.float64'>'
with 286674296 stored elements in Compressed Sparse Row format>
At this point, Python RAM usage is 4.6GB and I have 16GB of RAM on my laptop.
I think I'm running into a known memory bug for cPickle where it doesn't work with objects that are too big. I tried marshal
as well but I don't think it works for scipy matrices. Can someone offer a solution and preferably an example on how to load and save this?
Python 2.7.5
Mac OS 10.9
Thank you.
Upvotes: 3
Views: 4372
Reputation: 2479
I had this problem with a multi-gigabyte Numpy matrix (Ubuntu 12.04 with Python 2.7.3 - seems to be this issue: https://github.com/numpy/numpy/issues/2396 ).
I've solved it using numpy.savetxt()
/ numpy.loadtxt()
. The matrix is compressed adding a .gz file extension when saving.
Since I too had just a single matrix I did not investigate the use of HDF5.
Upvotes: 1
Reputation: 6192
Both numpy.savetxt
(only for arrays, not sparse matrices) and sklearn.externals.joblib.dump
(pickling, slow as hell and blew up memory usage) didn't work for me on Python 2.7.
Instead, I used scipy.sparse.save_npz
and it worked just fine. Keep in mind that it only works for csc
, csr
, bsr
, dia
or coo
matrices.
Upvotes: 0