Reputation: 11508
I am working with large matrices, like the Movielens 20m dataset. I restructured the online file such that it matches the dimensions mentioned on the page (138000 by 27000), since the original file contains indices that are more of the size (138000 by 131000), but contain a lot of empty columns. Simply throwing out those empty columns and re-indexing yields the desired dimensions.
Anyways, the snippet to cast the sparse csv file to a dense format looks like this:
import pandas as pd
from scipy import sparse
# note that the file is not the one described in the link, but the smaller one
X = pd.read_csv("ml-20m-dense.dat", sep=",", header=None)
mat = sparse.coo_matrix((X[2], (X[0], X[1]))).todense()
Now, the estimated size in memory should be close to 138000 * 27000 * 8 / (1024^3) = 27.5 GB.
Yet, when I inspect the processes with htop, the memory consumption shows me around 7 GB only, although approximately 32 GB virtual memory are reserved.
At first I thought that this might be due to some "efficiency trick" by either pandas reader, or the scipy.sparse
package, to circumvent blowing up memory consumption.
But even after I call my PCA function on it, it never increases the active memory consumption to the amount it should.
Note that calling mat.nbytes
returns the exact amount estimated, so it seems NumPy is at least aware of the data.
(PCA code for reference:)
from fbpca import pca
result = pca(mat, k=3, raw=False, n_iter=3)
Note that, although fbpca makes use of an randomized algorithm, and I am only computing the top three components, the code still performs a (single, but full) matrix multiplication of the input matrix with a (way smaller) random matrix. Essentially, it will still have to access every element in the input matrix at least once.
The last remark also makes this slightly different from posts I have found, like this, since in that post the elements are never really accessed.
Upvotes: 2
Views: 1598
Reputation: 970
I think your problem lies in the todense()
call, which uses np.asmatrix(self.toarray(order=order, out=out))
internally.
toarray
creates its output with np.zeros
. (See toarray, _process_toarray_args)
So your question can be reduced to: Why doesn't np.zeros
allocate enough memory?
The answer is probably lazy-initialization
and zero pages
:
Why does numpy.zeros takes up little space
Linux kernel: Role of zero page allocation at paging_init time
So all zero-regions in your matrix are actually in the same physical memory block and only a write to all entries will force the OS to allocate enough physical memory.
Upvotes: 3