bvilhjal
bvilhjal

Reputation: 136

What is the most efficient way to calculate a huge sparse diagonal correlation matrix?

I have a huge correlation matrix that I need to calculate, say 200000x200000, which is too large to store in memory. Luckily, most of the values are 0s, except for values close to the diagonal of the matrix, which I need to calculate. Hence, I'm curious if sparse matrices in scipy/numpy could help me speed things up.

The current way that I construct the data is as follows.

#Input variables are snps, and max_dist
num_snps, num_indivs = snps.shape    
corr_table = {}
for i in range(num_snps):
    corr_table[i] = {}

for i in range(0, num_snps - 1):
    start_i = i + 1
    end_i = min(start_i + max_dist, num_snps)
    corr_vec = sp.dot(snps[i], sp.transpose(snps[start_i:end_i])) / float(num_indivs)
    corr_vec = sp.array(corr_vec).flatten()
    for k in range(start_i, end_i):
        corr_vec_i = k - start_i
        corr_table[i][k] = corr_vec[corr_vec_i]
        corr_table[k][i] = corr_vec[corr_vec_i]
return corr_table

Here snps is a MxN matrix with standardised row-vectors (mean 0 and variance 1), for which I'd like to calculate the MxM correlation matrix. Currently the correlation matrix is stored as a huge dictionary (corr_table). The max_dist denotes the maximum distance between the a pair of SNPs (rows in the snps matrix) for which I calculate the correlation. For all other correlations (that are not in the corr_table) I assume they are 0.

Unfortunately, this is still not very efficient in practice, hence, I'd like to know if I can use matrix multiplications together with sparse matrices to calculate the correlation matrix more efficiently without using much more memory.

Any suggestions would be greatly appreciated.

Upvotes: 0

Views: 917

Answers (1)

hpaulj
hpaulj

Reputation: 231335

I haven't tried to understand or run your calculation, but I can add some pointers about sparse matrices.

There are a half dozen different sparse formats in the package, each with different strengths and weaknesses. And it is easy to convert one format to another; the sparse functions do that frequently. None are very good for incremental updating.

The dok format is actually a dictionary subclass. The keys are coordinate tuples, e.g. corr_table[(i,k)] = corr_vec[corr_vec_i]. I found in other SO questions that it is faster to build a plain dictionary with these keys, and then update the dok from that. There's more overhead in the corr_matrix[i,k]=... indexing.

lol is also relatively good for incremental changes. It stores values in 2 lists of lists, one sublist per row of the matrix.

csr is good for matrix calculations, but slow for indexed assignments. It is best built with the coo style of input, which uses 3 1d arrays, data, i,j. Traditionally sparse matrices have been built by constructing these 3 arrays, possibly as lists if doing it incrementally, and then passing them to coo_matrix.

There are block and diagonal formats as well, which may be better for some sparsity layouts.

I suspect this step:

for k in range(start_i, end_i):
    corr_vec_i = k - start_i
    corr_table[i][k] = corr_vec[corr_vec_i]
    corr_table[k][i] = corr_vec[corr_vec_i]

can be performed a numpy vector operation, something like

jj = np.arange(start_i, end_i)
ii = jj - start_i
data_list.append(corr_vec[ii])
row_list.append(ii)
col_list.append(jj)
data_list.append(corr_vec[ii])
row_list.append(jj)
col_list.append(ii)

where data_list etc are lists where I am collecting inputs for coo. They probably will have to be passed through np.concatenate to produce 1d arrays that sparse.coo_matrix can use.

I haven't tested this code so there are bugs, but hopefully it gives you some ideas to start with.

Upvotes: 1

Related Questions