Reputation: 294238

Quickest way to calculate subset of correlation matrix

I'm partial to using pandas builtin corr method for dataframes. However, I am trying to calculate the correlation matrix of a dataframe with 45,000 columns. And then repeat this 250 times. The calculation is crushing my ram (16 GB, mac book pro). I'm grabbing statistics on the columns of the resulting correlation matrix. So I need one column's correlation with every other column to calculate those statistics. My solution is to calculate correlation of a subset of columns with every other column, but I need an efficient way to do this.

Consider:

import pandas as pd
import numpy as np

np.random.seed([3,1415])

df = pd.DataFrame(np.random.rand(6, 4), columns=list('ABCD'))
df

I want to calculate the correlations for just ['A', 'B']

corrs = df.corr()[['A', 'B']]
corrs

I'll finish it off by calculating the mean or some other stat.

I can't use the code I used to create the example because when I scale up, I don't have the memory for it. When performing the calculation, it must use an amount of memory proportional to the number of columns chosen to calculate correlations relative to everything else.

I'm looking for the most performant solution at scale. I have a solution, but I'm looking for other ideas to ensure I'm getting the best. Any answer provided that returns the correct answer as shown in the demonstration and satisfies the memory constraint will be upvoted by me (and I'd encourage upvoting amongst each other as well).

Below is my code:

def corr(df, k=0, l=10):
    d = df.values - df.values.mean(0)
    d_ = d[:, k:l]
    s = d.std(0, keepdims=True)
    return pd.DataFrame(d.T.dot(d[:, k:l]) / s.T.dot(s[:, k:l]) / d.shape[0],
                        df.columns, df.columns[k:l])

Upvotes: 5

Answers (3)

Jeremy Feng

Reputation: 365

Further improvement to @user20160 's answer: https://stackoverflow.com/a/38195754/

I find np.dot is about 2 times faster than np.outer.

import numpy as np


x = np.random.rand(20, 50)

%%timeit
xsum = np.sum(x, 0)
np.outer(xsum, xsum[0:3])
# 11.1 µs ± 155 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

%%timeit
xsum = np.sum(x, 0, keepdims=True)
np.dot(xsum.T, xsum[:, 0:3])
# 6.44 µs ± 49.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

So the code can be further improved to:

import numpy as np


x = np.random.rand(20, 50)


def col_correlator(x):
    n = x.shape[0]
    xsum = np.sum(x, 0, keepdims=True)
    xstd = np.std(x, 0, keepdims=True)

    return lambda lo, hi: (
        (np.dot(x.T, x[:, lo:hi]) - np.dot(xsum.T, xsum[:, lo:hi])/n)
        / (np.dot(xstd.T, xstd[:, lo:hi])*n)
    )

# construct function to compute columns of correlation matrix
cc = col_correlator(x)

# compute columns of correlation matrix for dimensions 1 thru 3
r = cc(1, 4)

Upvotes: 1

Ofir Shifman

Reputation: 81

Credit goes to @user20160, @piRsquared.

I had a very similar issue. I was trying to compute just a quarter of the matrix: the correlations between some group of columns to another one.

I modified the code a bit, that it takes 4 parameters, for the 2 groups of vectors:

def col_correlator(x):
n = x.shape[0]
xsum = np.sum(x, 0)
xstd = np.std(x, 0)

return lambda lo_c, hi_c, lo_r, hi_r: (
    (np.dot(x[:, lo_r:hi_r].T, x[:, lo_c:hi_c]) - np.outer(xsum[lo_r:hi_r], xsum[lo_c:hi_c]) / n)
    / (np.outer(xstd[lo_r:hi_r], xstd[lo_c:hi_c]) * n)
)

# construct function to compute columns of correlation matrix
cc = col_correlator(x)

# compute columns of correlation matrix for dimensions 1 thru 3
r = cc(n, m,0,n)

Upvotes: 1

user20160

Reputation: 1394

Using dot products to compute the correlation (as in your example) seems like a good approach. I'll describe two improvements, then code implementing them.

Improvement 1: Pull means out of dot product

We can pull the means out of the dot product, to avoid having to subtract them from every value (similar to how you pulled the standard deviations out of the dot product, which we'll also do).

Let x, y be vectors with n elements. Let a, b be scalars. Let <x,y> denote the dot product between x and y.

The correlation between x and y can be expressed using the dot product

<(x-mean(x))/std(x), (y-mean(y))/std(y)> / n

To pull the standard deviations out of the dot product, we can use the following identity (as you did above):

<ax, by> = a*b*<x, y>

To pull the means out of the dot product, we can derive another identity:

<x+a, y+b> = <x,y> + a*sum(y) + b*sum(x) + a*b*n

In the case where a = -mean(x), b = -mean(y), this simplifies to:

<x-mean(x), y-mean(y)> = <x, y> - sum(x)*sum(y)/n

Using these identities, the correlation between x and y is equivalent to:

(<x, y> - sum(x)*sum(y)/n) / (std(x)*std(y)*n)

In the function below, this will be expressed using matrix multiplication and outer products to handle multiple variables simultaneously (as in your example).

Improvement 2: Pre-compute sums and standard deviations

We can pre-compute the sums and standard deviations, to avoid re-computing them for all columns every time the function is called.

Code

Putting the two improvements together, we have the following (I don't speak pandas, so it's in numpy):

def corr_cols(x, xsum, xstd, lo, hi):
    n = x.shape[0]

    return (
        (np.dot(x.T, x[:, lo:hi]) - np.outer(xsum, xsum[lo:hi])/n)
        / (np.outer(xstd, xstd[lo:hi])*n)
    )

# fake data w/ 10 points, 5 dimensions
x = np.random.rand(10, 5)

# precompute sums and standard deviations along each dimension
xsum = np.sum(x, 0)
xstd = np.std(x, 0)

# calculate columns of correlation matrix for dimensions 1 thru 3
r = corr_cols(x, xsum, xstd, 1, 4)

Better code

Pre-computing and storing the sums and standard deviations can be hidden inside a closure, to give a nicer interface and keep the main code cleaner. Functionally, the operations are equivalent to the previous code.

def col_correlator(x):
    n = x.shape[0]
    xsum = np.sum(x, 0)
    xstd = np.std(x, 0)

    return lambda lo, hi: (
        (np.dot(x.T, x[:, lo:hi]) - np.outer(xsum, xsum[lo:hi])/n)
        / (np.outer(xstd, xstd[lo:hi])*n)
    )

# construct function to compute columns of correlation matrix
cc = col_correlator(x)

# compute columns of correlation matrix for dimensions 1 thru 3
r = cc(1, 4)

EDIT: (piRSquared)

I wanted to put my edit in this post to further encourage upvoting of this answer.

This is the code I implemented utilizing this advice. This solution translates back and forth between pandas and numpy.

def corr_closure(df):
    d = df.values
    sums = d.sum(0, keepdims=True)
    stds = d.std(0, keepdims=True)
    n = d.shape[0]

    def corr(k=0, l=10):
        d2 = d.T.dot(d[:, k:l])
        sums2 = sums.T.dot(sums[:, k:l])
        stds2 = stds.T.dot(stds[:, k:l])

        return pd.DataFrame((d2 - sums2 / n) / stds2 / n,
                            df.columns, df.columns[k:l])

    return corr

Use case:

corr = corr_closure(df)

corr(0, 2)

Upvotes: 6