Mohammad
Mohammad

Reputation: 1013

How to calculate correlation between 1D numpy array and every column of a 2D numpy array

I have a 1D numpy array (y) and 2D numpy array (x) and I calculate correlation between y and every column in x as below:

import numpy as np
from scipy.stats import pearsonr

rng = np.random.default_rng(seed=42)

x = rng.random((3, 3))
y = rng.random(3)

for i in range(x.shape[1]):
    print( pearsonr(x[:, i], y)[0]  )

I was wondering how I can get the correlation values without For loop. Is there any way?

Upvotes: 1

Views: 2312

Answers (2)

Filipe Maia
Filipe Maia

Reputation: 76

Here's a numpy native solution without for loops:

def vector_corr_np(data1, data2):
    data1 = np.atleast_2d(data1)
    data2 = np.atleast_2d(data2)
    mean1 = data1.mean(axis=1) 
    mean2 = data2.mean(axis=1)
    std1 = data1.std(axis=1)
    std2 = data2.std(axis=1)
    corr = ((data1*data2).mean(axis=1)-mean1*mean2)/(std1*std2)
    return corr
import numpy as np

rng = np.random.default_rng(seed=42)
n = 20000
x = rng.random((n, n))
y = rng.random(n)
%timeit vector_corr_np(x, y)
5.46 s ± 32.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 1

Massifox
Massifox

Reputation: 4487

I propose these approaches, all of which lead to the same result as your proposed solution:

  • Approach 1: A solution similar to the one proposed by Lucas M. Uriarte, using numpy.corrcoef:

    np.corrcoef(y,x.T)[0][1:]
    
  • Approach 2: The function for calculating the correlation is rewritten using numpy functions:

    def corr_np(data1, data2):
        mean1 = data1.mean() 
        mean2 = data2.mean()
        std1 = data1.std()
        std2 = data2.std()
        corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2)
        return corr
    
    def paerson_np(x, y):
        return np.array([corr_np(x[:, i], y) for i in range(x.shape[1])])
    
  • Approach 3: The function for calculating the correlation is rewritten, using numba to speed up the calculations:

    @nb.njit()
    def corr_nb(data1, data2):
        M = data1.size
        sum1 = 0.
        sum2 = 0.
        for i in range(M):
            sum1 += data1[i]
            sum2 += data2[i]
        mean1 = sum1 / M
        mean2 = sum2 / M
    
        var_sum1 = 0.
        var_sum2 = 0.
        cross_sum = 0.
        for i in range(M):
            var_sum1 += (data1[i] - mean1) ** 2
            var_sum2 += (data2[i] - mean2) ** 2
            cross_sum += (data1[i] * data2[i])
    
       std1 = (var_sum1 / M) ** .5
       std2 = (var_sum2 / M) ** .5
       cross_mean = cross_sum / M
    
       return (cross_mean - mean1 * mean2) / (std1 * std2)
    
    @nb.njit()
    def paerson_nb(x, y):
        return np.array([corr_nb(x[:, i], y) for i in range(x.shape[1])])
    

Execution time comparison

I experimented to see which solution was more efficient, comparing the 3 approaches I listed above and your solution (which I will call approach 0). The instances for the experiments have the following structure:

import numpy as np
import numba as nb
from scipy.stats import pearsonr

rng = np.random.default_rng(seed=42)
n = 20000
x = rng.random((n, n))
y = rng.random(n)

Results:

  • Approch 0 (your solution):

    %timeit approach0(x, y) :-> 15.6 s ± 200 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
  • Approach 1:

    %timeit np.corrcoef(y,x.T)[0][1:] :-> 37.4 s ± 3.68 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
  • Approach 2:

    %timeit paerson_np(x, y) :-> 19.1 s ± 351 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
  • Approach 3:

    %timeit paerson_nb(x, y) :-> 7.81 s ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

The solution with numba(approch 3), is about 2 times faster than your solution(approach 0) and the solution with numpy(approach 2). The solution with numpy.corrcoef is clearly the slowest: about 2 times slower than aprroaches 0 and 2, and even more than 5 times slower than the solution with numba.

Upvotes: 3

Related Questions