Ferdinando Randisi
Ferdinando Randisi

Reputation: 4388

Computing autocorrelation of vectors with numpy

I'm struggling to come up with a non-obfuscating, efficient way of using numpy to compute a self correlation function in a set of 3D vectors.

I have a set of vectors in a 3d space, saved in an array

a = array([[ 0.24463039,  0.58350592,  0.77438803],
       [ 0.30475903,  0.73007075,  0.61165238],
       [ 0.17605543,  0.70955876,  0.68229821],
       [ 0.32425896,  0.57572195,  0.7506    ],
       [ 0.24341381,  0.50183697,  0.83000565],
       [ 0.38364726,  0.62338687,  0.68132488]])

their self correlation function is defined as enter image description here

in case the image above doesn't stay available, the formula is also printed below: C(t,{v}n) = \frac 1{n-t}\sum{i=0}^{n-1-t}\vec v_i\cdot\vec v_{i+t}

I'm struggling to code this up in an efficient way, non-obfuscating way. I can compute this expliticly with two nested for loops, but that's slow. There is a fast way of doing it by using one of the embedded functions in numpy, but they seem to be using an entirely different definition of correlation function. A similar problem has been solved here, How can I use numpy.correlate to do autocorrelation? but that doesn't handle vectors. Do you know how can I solve this?

Upvotes: 2

Views: 4781

Answers (4)

Carlo Valensise
Carlo Valensise

Reputation: 11

Hi I faced a imilar issue. Here is my idea

def fast_vector_correlation(M):
    n_row = M.shape[0]
    dot_mat = M.dot(M.T)
    corr = [np.trace(dot_mat,offset=x) for x in range(n_row)]
    corr/=(n_row-np.arange(n_row))
    return corr

The idea is that dot_mat contains all the scalar product between the row vectors. To compute the correlation at different t values you have just to sum the diagonals (of the upper right riangular part), as show in the picture.

Upvotes: 1

Pierre de Buyl
Pierre de Buyl

Reputation: 7293

The NumPy routines are for 1D arrays. As a "minimal" improvement, use a vectorized operation for the normalisation step (use of np.arange in the before-to-last line):

def vector_autocorrelate(t_array):
  n_vectors = len(t_array)
  # correlate each component indipendently
  acorr = np.array([np.correlate(t_array[:,i],t_array[:,i],'full') for i in xrange(3)])[:,n_vectors-1:]
  # sum the correlations for each component
  acorr = np.sum(acorr, axis = 0)
  # divide by the number of values actually measured and return
  acorr /= (n_vectors - np.arange(n_vectors))
  return acorr

For larger array sizes, you should consider using the Fourier Transform algorithm for correlation. Check out the examples of the library tidynamics if you are interested in that (disclaimer: I wrote the library, it depends only on NumPy).

For reference, here are the timings for a NumPy-code (that I wrote for testing), your code vector_autocorrelate and tidynamics.

size = [2**i for i in range(4, 17, 2)]

np_time = []
ti_time = []
va_time = []
for s in size:
    data = np.random.random(size=(s, 3))
    t0 = time.time()
    correlate = np.array([np.correlate(data[:,i], data[:,i], mode='full') for i in range(data.shape[1])])[:,:s]
    correlate = np.sum(correlate, axis=0)/(s-np.arange(s))
    np_time.append(time.time()-t0)
    t0 = time.time()
    correlate = tidynamics.acf(data)
    ti_time.append(time.time()-t0)
    t0 = time.time()
    correlate = vector_autocorrelate(data)
    va_time.append(time.time()-t0)

You can see the results:

print("size", size)
print("np_time", np_time)
print("va_time", va_time)
print("ti_time", ti_time)

size [16, 64, 256, 1024, 4096, 16384, 65536]

np_time [0.00023794174194335938, 0.0002703666687011719, 0.0002713203430175781, 0.001544952392578125, 0.0278470516204834, 0.36094141006469727, 6.922360420227051]

va_time [0.00021696090698242188, 0.0001690387725830078, 0.000339508056640625, 0.0014629364013671875, 0.024930953979492188, 0.34442687034606934, 7.005630731582642]

ti_time [0.0011148452758789062, 0.0008449554443359375, 0.0007512569427490234, 0.0010488033294677734, 0.0026645660400390625, 0.007939338684082031, 0.048232316970825195]

or plot them

plt.plot(size, np_time)
plt.plot(size, va_time)
plt.plot(size, ti_time)
plt.loglog()

For anything but very small data series, the "N**2" algorithm is unusable.

Upvotes: 2

jboi
jboi

Reputation: 11892

Here's my result. It is slower (about 4x) and delivers other results. Why do I post it anyway? I thought it's worth to see how to measure and what's the difference. If - in addition - anybody finds the reason for the different results, I'd be more then happy.

So, here's my solution:

%timeit [np.mean([np.dot(a[t], a[i+t]) for i in range(len(a)-t)]) for t in range(len(a))]

Result: 95.2 µs ± 3.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In comparison, your solution is about 4x faster:

%timeit vector_autocorrelate(a)

delivers: 24.8 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Upvotes: 1

Ferdinando Randisi
Ferdinando Randisi

Reputation: 4388

I'm posting the answer here in case someone else will need it, since it took me quite some time to figure out a viable approach. I ended up solving this by defining the following function

def vector_autocorrelate(t_array):
  n_vectors = len(t_array)
  # correlate each component indipendently
  acorr = np.array([np.correlate(t_array[:,i],t_array[:,i],'full') for i in xrange(3)])[:,n_vectors-1:]
  # sum the correlations for each component
  acorr = np.sum(acorr, axis = 0)
  # divide by the number of values actually measured and return
  acorr = np.array( [ val / (n_vectors - i) for i,val in enumerate(acorr)])
  return acorr

If anybody has a better idea, I would really like to hear that since I think the current one is still not as compact as it should be. It's better than nothing though, which is why I'm posting it here.

Upvotes: 1

Related Questions