Michael Sughrue
Michael Sughrue

Reputation: 211

Memory issues with creating an adjacency matrix using Coo-matrix

Hi i am trying to generate an adjacency matrix with a dimension of about 24,000 from a CSV with two columns showing combinations of pairs of genes and a column of 1's to indicate a present interaction....My goal is to have it be square and populated with zeros for combinations not in the two columns

I am using the following Python script

import numpy as np
from scipy.sparse import coo_matrix

l, c, v = np.loadtxt("biogrid2.csv", dtype=(int), skiprows=0, delimiter=",").T[:3, :]
m =coo_matrix((l, (v-1, c-1)), shape=(v.max(), c.max()))

m.toarray()

and it runs ok until encountering the following errorIt seems

File "/home/charlie/anaconda3/lib/python3.6/site-packages/scipy/sparse/base.py", line 1184, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)

MemoryError

Any ideas about how to get around the memory limit in Scipy

Thanks

Upvotes: 1

Views: 756

Answers (2)

Daniel F
Daniel F

Reputation: 14399

Most likely what you want isn't m.toarray but m.tocsr(). a csr matrix can do simple linear algebra (like .dot() and matrix powers) natively, for instance this works:

m.tocsr()
random_walk_2 = m.dot(m)
random_walk_n = m ** n  
# see https://stackoverflow.com/questions/28702416/matrix-power-for-sparse-matrix-in-python

Covariance should be implementable as well, but I'm not sure what the specific implementation would be without seeing what your current process is.

EDIT: To turn the output back into a simpler format to read out to csv, you can follow up by returning to coo with .tocoo()

m.tocoo()
out = np.c_[m.data, m.row, m.col].T
np.savetxt("foo.csv", out, delimiter=",") 
# see https://stackoverflow.com/questions/6081008/dump-a-numpy-array-into-a-csv-file

Upvotes: 1

PilouPili
PilouPili

Reputation: 2699

The function toarray() will convert your 24000*24000 sparse matrix (coo_matrix) into a dense array of 24000*24000 (assuming you are loading int) which needs in terms of memory at least

24000*24000*4 = around 2,15Gb.

To avoid using so much memory you should avoid converting to dense matrix (using toarray()) and do your operations with sparse matrix

If you need your matrix squared you can just do m*m or m.multiply(m) and you will get a sparse matrix.

To save your matrix you have several option.

Simplest one is NPZ see https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.sparse.save_npz.html or Save / load scipy sparse csr_matrix in portable data format

If you want to get your result as your initial CSV file coo_matrix has attributes

data COO format data array of the matrix

row COO format row index array of the matrix

col COO format column index array of the matrix

see https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html

which can be used to create the CSV file.

Upvotes: 0

Related Questions