Reputation: 1608
I have some very large matrices (let say of the order of the million rows), that I can not keep in memory, and I would need to access to subsample of this matrix in descent time (less than a minute...). I started looking at hdf5 and blaze in combination with numpy and pandas:
But I found it a bit complicated, and I am not sure if it is the best solution.
Are there other solutions?
thanks
EDIT
Here some more specifications about the kind of data I am dealing with.
And what I would need to do is:
Upvotes: 4
Views: 168
Reputation: 10759
Your question is lacking a bit in context; but hdf5 compressed block storage is probably as-efficient as a sparse storage format for these relatively dense matrices you describe. In memory, you can always cast your views to sparse matrices if it pays. That seems like an effective and simple solution; and as far as I know there are no sparse matrix formats which can easily be read partially from disk.
Upvotes: 0
Reputation: 1809
Did you try PyTables ? It can be very useful for very large matrix. Take a look to this SO post.
Upvotes: 1