Correlation coefficients for sparse matrix in python?

Does anyone know how to compute a correlation matrix from a very large sparse matrix in python? Basically, I am looking for something like numpy.corrcoef that will work on a scipy sparse matrix.

Upvotes: 17

Answers (4)

Jake Drew

Reputation: 2340

I present an answer for a scipy sparse matrix which runs in parallel. Rather than returning a giant correlation matrix, this returns a feature mask of fields to keep after checking all fields for both positive and negative Pearson correlations.

I also try to minimize calculations using the following strategy:

Process each column
Start at the current column + 1 and calculate correlations moving to the right.
For any abs(correlation) >= threshold, mark the current column for removal and calculate no further correlations.
Perform these steps for each column in the dataset except the last.

This might be sped up further by keeping a global list of columns marked for removal and skipping further correlation calculations for such columns, since columns will execute out of order. However, I do not know enough about race conditions in python to implement this tonight.

Returning a column mask will obviously allow the code to handle much larger datasets than returning the entire correlation matrix.

Check each column using this function:

def get_corr_row(idx_num, sp_mat, thresh):
    # slice the column at idx_num
    cols = sp_mat.shape[1]
    x = sp_mat[:,idx_num].toarray().ravel()
    start = idx_num + 1
    
    # Now slice each column to the right of idx_num   
    for i in range(start, cols):
        y = sp_mat[:,i].toarray().ravel()
        # Check the pearson correlation
        corr, pVal = pearsonr(x,y)
        # Pearson ranges from -1 to 1.
        # We check both positive and negative correlations >= thresh using abs(corr)
        if abs(corr) >= thresh:
            # stop checking after finding the 1st correlation > thresh   
            return False
            # Mark column at idx_num for removal in the mask  
    return True

Run the column level correlation checks in parallel:

from joblib import Parallel, delayed  
import multiprocessing


def Get_Corr_Mask(sp_mat, thresh, n_jobs=-1):
    
    # we must make sure the matrix is in csc format 
    # before we start doing all these column slices!  
    sp_mat = sp_mat.tocsc()
    cols = sp_mat.shape[1]
    
    if n_jobs == -1:
        # Process the work on all available CPU cores
        num_cores = multiprocessing.cpu_count()
    else:
        # Process the work on the specified number of CPU cores
        num_cores = n_jobs

    # Return a mask of all columns to keep by calling get_corr_row() 
    # once for each column in the matrix     
    return Parallel(n_jobs=num_cores, verbose=5)(delayed(get_corr_row)(i, sp_mat, thresh)for i in range(cols))

General Usage:

#Get the mask using your sparse matrix and threshold.
corr_mask = Get_Corr_Mask(X_t_fpr, 0.95) 

# Remove features that are >= 95% correlated
X_t_fpr_corr = X_t_fpr[:,corr_mask]

Upvotes: 1

Alt

Reputation: 2755

You do not need to introduce a large dense matrix. Just keep it sparse using Numpy:

import numpy as np    
def sparse_corr(A):
    N = A.shape[0]
    C=((A.T*A -(sum(A).T*sum(A)/N))/(N-1)).todense()
    V=np.sqrt(np.mat(np.diag(C)).T*np.mat(np.diag(C)))
    COR = np.divide(C,V+1e-119)
    return COR

Testing the performance:

A = sparse.rand(1000000, 100, density=0.1, format='csr')
sparse_corr(A)

Upvotes: 12

Amin

Reputation: 1034

Unfortunately, Alt's answer didn't work out for me. The values given to the np.sqrt function where mostly negative, so the resulting covariance values were nan.

I wasn't able to use ali_m's answer as well, because my matrix was too large that I couldn't fit the centering = rowsum.dot(rowsum.T.conjugate()) / n matrix in my memory (My matrix's dimensions are: 3.5*10^6 x 33)

Instead, I used scikit-learn's StandardScaler to compute the standard sparse matrix and then used a multiplication to obtain the correlation matrix.

from sklearn.preprocessing import StandardScaler

def compute_sparse_correlation_matrix(A):
    scaler = StandardScaler(with_mean=False)
    scaled_A = scaler.fit_transform(A)  # Assuming A is a CSR or CSC matrix
    corr_matrix = (1/scaled_A.shape[0]) * (scaled_A.T @ scaled_A)
    return corr_matrix

I believe that this approach is faster and more robust than the other mentioned approaches. Moreover, it also preserves the sparsity pattern of the input matrix.

Upvotes: 0

ali_m

Reputation: 74232

You can compute the correlation coefficients fairly straightforwardly from the covariance matrix like this:

import numpy as np
from scipy import sparse

def sparse_corrcoef(A, B=None):

    if B is not None:
        A = sparse.vstack((A, B), format='csr')

    A = A.astype(np.float64)
    n = A.shape[1]

    # Compute the covariance matrix
    rowsum = A.sum(1)
    centering = rowsum.dot(rowsum.T.conjugate()) / n
    C = (A.dot(A.T.conjugate()) - centering) / (n - 1)

    # The correlation coefficients are given by
    # C_{i,j} / sqrt(C_{i} * C_{j})
    d = np.diag(C)
    coeffs = C / np.sqrt(np.outer(d, d))

    return coeffs

Check that it works OK:

# some smallish sparse random matrices
a = sparse.rand(100, 100000, density=0.1, format='csr')
b = sparse.rand(100, 100000, density=0.1, format='csr')

coeffs1 = sparse_corrcoef(a, b)
coeffs2 = np.corrcoef(a.todense(), b.todense())

print(np.allclose(coeffs1, coeffs2))
# True

Be warned:

The amount of memory required for computing the covariance matrix C will be heavily dependent on the sparsity structure of A (and B, if given). For example, if A is an (m, n) matrix containing just a single column of non-zero values then C will be an (n, n) matrix containing all non-zero values. If n is large then this could be very bad news in terms of memory consumption.

Upvotes: 14

Correlation coefficients for sparse matrix in python?

Answers (4)

Testing the performance:

Be warned:

Related Questions