Reducing calculation time and requirements for large covariance matrix

Question

I currently am trying to calculate a covariance matrix for a ~30k row matrix (all values are in range of [0,1]), and its taking a very long time (I have let it run for over and hour and it still hasn't completed).

One thing i noticed on smaller examples (a 7k row matrix) is that the values outputted have a ridiculous number of significant digits (e.g. ~10^32) and may be slowing things down (and increasing file size)--is there any way to limit this?

I've been using numpys covariance method on a simple dataframe:

import numpy as np
import pandas as pd
import sklearn as sk

df = pd.read_csv('gene_data/genetic_data25.csv')

df = df.set_index('ID_REF')
df = (df-df.min(axis = 0))/(df.max(axis = 0)-df.min(axis = 0))

cov = np.cov(df)

cov = pd.DataFrame(cov)

cov.to_csv('/gemnetics/cov_matrix.csv')

anishtain4 · Accepted Answer

Since I'm not familiar with genetics, I'll give you the general guidelines and hope it works. Let's assume you have your data in a matrix called X which is 30+k by 1k. You don't really need to normalize your data (unless it doesn't matter to you) but to calculate the covariance you have to center it. Then you can calculate the right eigenvectors:

Xp=X-X.mean(axis=0,keepdims=True)
k=Xp.T @ Xp
ev,R=np.linalg.eigh(k)
ev=ev[::-1]
R=R[:,::-1]

At this point you should look at the eigenvalues to see if there's any abrupt drop in them (this is Scree method), let's call this cut-off number n. If not, then you just have to choose which percent of the eigenvalues you want to keep. Next step would be reconstructing the left eigenvectors:

L=X @ R[:,:n]

Now R.T tells you which combination of eigenvectors are important and eigenvectors (L) are the most prominent combinations of your genes. I hope this helps.

Reducing calculation time and requirements for large covariance matrix

Answers (1)

Related Questions