What is the most efficient way to calculate a huge sparse diagonal correlation matrix?

Question

I have a huge correlation matrix that I need to calculate, say 200000x200000, which is too large to store in memory. Luckily, most of the values are 0s, except for values close to the diagonal of the matrix, which I need to calculate. Hence, I'm curious if sparse matrices in scipy/numpy could help me speed things up.

The current way that I construct the data is as follows.

#Input variables are snps, and max_dist
num_snps, num_indivs = snps.shape    
corr_table = {}
for i in range(num_snps):
    corr_table[i] = {}

for i in range(0, num_snps - 1):
    start_i = i + 1
    end_i = min(start_i + max_dist, num_snps)
    corr_vec = sp.dot(snps[i], sp.transpose(snps[start_i:end_i])) / float(num_indivs)
    corr_vec = sp.array(corr_vec).flatten()
    for k in range(start_i, end_i):
        corr_vec_i = k - start_i
        corr_table[i][k] = corr_vec[corr_vec_i]
        corr_table[k][i] = corr_vec[corr_vec_i]
return corr_table

Here snps is a MxN matrix with standardised row-vectors (mean 0 and variance 1), for which I'd like to calculate the MxM correlation matrix. Currently the correlation matrix is stored as a huge dictionary (corr_table). The max_dist denotes the maximum distance between the a pair of SNPs (rows in the snps matrix) for which I calculate the correlation. For all other correlations (that are not in the corr_table) I assume they are 0.

Unfortunately, this is still not very efficient in practice, hence, I'd like to know if I can use matrix multiplications together with sparse matrices to calculate the correlation matrix more efficiently without using much more memory.

Any suggestions would be greatly appreciated.

What is the most efficient way to calculate a huge sparse diagonal correlation matrix?

Answers (1)

Related Questions