epattaro
epattaro

Reputation: 2438

memory efficient way to make large zeros matrix python

I am currently trying to make a really large matrix, i am unsure how to do so in a memory efficient way.

I was trying to use numpy, which worked fine for my smaller case (2750086X300) However, i got a larger one, 2750086X1000, which is just too big for me to run.

I though about making it out of ints, but I will add float values to it, so unsure how that cld affect it.

I tried find something about making a sparse zero filled array, but cldnt find any great topics/questions/suggestions here or elsewhere.

Anyone got any good advice? I am currently using python so I am kind of looking for a pythonic solution, but i am willing to try other languages.

Thx


edit:

thx for advices, i ve tried scipy.sparse.csr_matrix which managed to create a matrix but deeply increased the time to go through it.

heres kind of what i am doing:

matrix = scipy.sparse.csr_matrix((df.shape[0], 300))
## matrix = np.zeros((df.shape[0], 

for i, q in enumerate(df['column'].values):    

    matrix[i, :] = function(q)

where function is pretty much a vector operation function on that row.

Now, if i do the loop on the np.zeros, it does so quite easily, about 10 minuts.

Now, if i try to do the same with the scipy sparse matrix, it takes about 50 hours. which is not that reasonable.

Any advices?


Edit 2:

scipy.sparse.lil_matrix did the trick

takes about 20 minut for the loop and uses way less memory than np.zeros

Thx.


Edit 3:

still memory expensive. decided to not store data on matrix. process 1 row at a time. get relevant value/metric out of it, store value at original df, run again.

Upvotes: 1

Views: 2893

Answers (1)

Miriam Farber
Miriam Farber

Reputation: 19624

Try scipy.sparse.csr_matrix:

from scipy.sparse import *
from scipy import *
a=csr_matrix( (2750086,1000), dtype=int8 )

Then a is

<2750086x1000 sparse matrix of type '<class 'numpy.int8'>'
    with 0 stored elements in Compressed Sparse Row format>

For example, if you do:

from scipy.sparse import *
from scipy import *
a=csr_matrix( (5,4), dtype=int8 ).todense()
print(a)

You get:

[[0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]]

Another options is to use scipy.sparse.lil_matrix

a = scipy.sparse.lil_matrix((2750086,1000), dtype=int8 )

This seems to be more efficient for setting elements (like a[1,1]=2).

Upvotes: 7

Related Questions