Reputation: 2438
I am currently trying to make a really large matrix, i am unsure how to do so in a memory efficient way.
I was trying to use numpy, which worked fine for my smaller case (2750086X300) However, i got a larger one, 2750086X1000, which is just too big for me to run.
I though about making it out of ints, but I will add float values to it, so unsure how that cld affect it.
I tried find something about making a sparse zero filled array, but cldnt find any great topics/questions/suggestions here or elsewhere.
Anyone got any good advice? I am currently using python so I am kind of looking for a pythonic solution, but i am willing to try other languages.
Thx
edit:
thx for advices, i ve tried scipy.sparse.csr_matrix which managed to create a matrix but deeply increased the time to go through it.
heres kind of what i am doing:
matrix = scipy.sparse.csr_matrix((df.shape[0], 300))
## matrix = np.zeros((df.shape[0],
for i, q in enumerate(df['column'].values):
matrix[i, :] = function(q)
where function is pretty much a vector operation function on that row.
Now, if i do the loop on the np.zeros, it does so quite easily, about 10 minuts.
Now, if i try to do the same with the scipy sparse matrix, it takes about 50 hours. which is not that reasonable.
Any advices?
Edit 2:
scipy.sparse.lil_matrix did the trick
takes about 20 minut for the loop and uses way less memory than np.zeros
Thx.
Edit 3:
still memory expensive. decided to not store data on matrix. process 1 row at a time. get relevant value/metric out of it, store value at original df, run again.
Upvotes: 1
Views: 2893
Reputation: 19624
from scipy.sparse import *
from scipy import *
a=csr_matrix( (2750086,1000), dtype=int8 )
Then a
is
<2750086x1000 sparse matrix of type '<class 'numpy.int8'>'
with 0 stored elements in Compressed Sparse Row format>
For example, if you do:
from scipy.sparse import *
from scipy import *
a=csr_matrix( (5,4), dtype=int8 ).todense()
print(a)
You get:
[[0 0 0 0]
[0 0 0 0]
[0 0 0 0]
[0 0 0 0]
[0 0 0 0]]
Another options is to use scipy.sparse.lil_matrix
a = scipy.sparse.lil_matrix((2750086,1000), dtype=int8 )
This seems to be more efficient for setting elements (like a[1,1]=2
).
Upvotes: 7