Reputation: 10010
Let's say I have a set of vectors (readings from sensor 1, readings from sensor 2, readings from sensor 3 -- indexed first by timestamp and then by sensor id) that I'd like to correlate to a separate set of vectors (temperature, humidity, etc -- also all indexed first by timestamp and secondly by type).
What is the cleanest way in numpy to do this? It seems like it should be a rather simple function...
In other words, I'd like to see:
> a.shape
(365,20)
> b.shape
(365, 5)
> correlations = magic_correlation_function(a,b)
> correlations.shape
(20, 5)
Cheers, /YGA
P.S. I've been asked to add an example.
Here's what I would like to see:
$ In [27]: x
$ Out[27]:
array([[ 0, 0, 0],
[-1, 0, -1],
[-2, 0, -2],
[-3, 0, -3],
[-4, 0.1, -4]])
$ In [28]: y
$ Out[28]:
array([[0, 0],
[1, 0],
[2, 0],
[3, 0],
[4, 0.1]])
$ In [28]: magical_correlation_function(x, y)
$ Out[28]:
array([[-1. , 0.70710678, 1. ]
[-0.70710678, 1. , 0.70710678]])
Ps2: whoops, mis-transcribed my example. Sorry all. Fixed now.
Upvotes: 4
Views: 1803
Reputation: 111866
Will this do what you want?
correlations = dot(transpose(a), b)
Note: if you do this, you'll probably want to standardize or whiten a
and b
first, e.g. something equivalent to this:
a = sqrt((a - mean(a))/(var(a)))
b = sqrt((b - mean(b))/(var(b)))
Upvotes: 1
Reputation: 8128
The simplest thing that I could find was using the scipy.stats package
In [8]: x
Out[8]:
array([[ 0. , 0. , 0. ],
[-1. , 0. , -1. ],
[-2. , 0. , -2. ],
[-3. , 0. , -3. ],
[-4. , 0.1, -4. ]])
In [9]: y
Out[9]:
array([[0. , 0. ],
[1. , 0. ],
[2. , 0. ],
[3. , 0. ],
[4. , 0.1]])
In [10]: import scipy.stats
In [27]: (scipy.stats.cov(y,x)
/(numpy.sqrt(scipy.stats.var(y,axis=0)[:,numpy.newaxis]))
/(numpy.sqrt(scipy.stats.var(x,axis=0))))
Out[27]:
array([[-1. , 0.70710678, -1. ],
[-0.70710678, 1. , -0.70710678]])
These aren't the numbers you got, but you've mixed up your rows. (Element [0,0] should be 1.)
A more complicated, but purely numpy solution is
In [40]: numpy.corrcoef(x.T,y.T)[numpy.arange(x.shape[1])[numpy.newaxis,:]
,numpy.arange(y.shape[1])[:,numpy.newaxis]]
Out[40]:
array([[-1. , 0.70710678, -1. ],
[-0.70710678, 1. , -0.70710678]])
This will be slower because it computes the correlation of each element in x with each other element in x, which you don't want. Also, the advanced indexing techniques used to get the subset of the array you desire can make your head hurt.
If you're going to use numpy intensely, get familiar with the rules on broadcasting and indexing. They will help you push as much down to the C-level as possible.
Upvotes: 2
Reputation: 3374
As David said, you should define the correlation you're using. I don't know of any definitions of correlation that gives sensible numbers when correlating empty and non-empty signals.
Upvotes: -1