Reputation: 4618

How to efficiently calculate euclidean distance matrix for several timeseries

I have 6 timeseries data as follows namely t1, t2, t3, t4, t5 and t6.

import numpy as np
series = np.array([
     [0., 0, 1, 2, 1, 0, 1, 0, 0],
     [0., 1, 2, 0, 0, 0, 0, 0, 0],
     [1., 2, 0, 0, 0, 0, 0, 1, 1],
     [0., 0, 1, 2, 1, 0, 1, 0, 0],
     [0., 1, 2, 0, 0, 0, 0, 0, 0],
     [1., 2, 0, 0, 0, 0, 0, 1, 1]])

I want to create a euclidean distance matrix from these 6 timeseries as in the format of (i.e. 6*6 where x denotes the corresponding euclidean distance):

     t1  t2  t3  t4  t5  t6
t1    0   x   x   x   x   x
t2    x   0   x   x   x   x
t3    x   x   0   x   x   x
t4    x   x   x   0   x   x
t5    x   x   x   x   0   x
t6    x   x   x   x   x   0

I am currently constructing this matrix manually as follows (In this SO question: Efficient and precise calculation of the euclidean distance this method has got the hightest performance).

e.g., to calculate euclidean distance between t3 and t6.

def eudis(v1, v2):
    dist = [(a - b)**2 for a, b in zip(v1, v2)]
    dist = math.sqrt(sum(dist))
    return dist

eudis(t3, t6)

However, I am sure that there could be more easy and computationally efficient way to do this in python. Please let me know if you have suggestions.

I am happy to provide more details if needed.

Upvotes: 2

Answers (3)

Stef

Reputation: 30609

You can also use pdist to get the distance matrix:

from scipy.spatial.distance import pdist, squareform
squareform(pdist(series))

Performance comparison with pure numpy and euclidean_distances solutions:

So for relatively small datasets (up to about 20 series with 200 elements each) pdist is fastest, for larger datasets euclidean_disances performs much better. pure numpy is mostly slower and may fail to allocate the intermediate array for large datasets.
Tested with np.random.randint(0, 100, (n, 10*n)).astype('int16'), numpy 1.17.4, scipy 1.4.1, sklearn 0.23.1, python 3.8.2, Win10 64bit.

Upvotes: 3

Mercury

Reputation: 4181

You can create a distance matrix in simple numpy in one line, you don't need anything else.

np.sqrt(((series[:,None,:] - series)**2).sum(axis=2))

Upvotes: 2

yatu

Reputation: 88276

You don't need to loop at all, for the euclidean distance between two arrays just compute the elementwise squares of the differences as:

def euclidean_distance(v1, v2):
    return np.sqrt(np.sum((v1 - v2)**2))

And for the distance matrix, you have sklearn.metrics.pairwise.euclidean_distances:

from sklearn.metrics.pairwise import euclidean_distances

euclidean_distances(a).round(2)

array([[0.  , 2.83, 3.74, 0.  , 2.83, 3.74],
       [2.83, 0.  , 2.83, 2.83, 0.  , 2.83],
       [3.74, 2.83, 0.  , 3.74, 2.83, 0.  ],
       [0.  , 2.83, 3.74, 0.  , 2.83, 3.74],
       [2.83, 0.  , 2.83, 2.83, 0.  , 2.83],
       [3.74, 2.83, 0.  , 3.74, 2.83, 0.  ]])

np.allclose(
    eudis(series[2], series[3]),
    euclidean_distance(series[2], series[3])
)
# True

Upvotes: 2

How to efficiently calculate euclidean distance matrix for several timeseries

Answers (3)

Related Questions