Reputation: 4618
I have 6 timeseries data as follows namely t1, t2, t3, t4, t5 and t6.
import numpy as np
series = np.array([
[0., 0, 1, 2, 1, 0, 1, 0, 0],
[0., 1, 2, 0, 0, 0, 0, 0, 0],
[1., 2, 0, 0, 0, 0, 0, 1, 1],
[0., 0, 1, 2, 1, 0, 1, 0, 0],
[0., 1, 2, 0, 0, 0, 0, 0, 0],
[1., 2, 0, 0, 0, 0, 0, 1, 1]])
I want to create a euclidean distance matrix from these 6 timeseries as in the format of (i.e. 6*6 where x
denotes the corresponding euclidean distance):
t1 t2 t3 t4 t5 t6
t1 0 x x x x x
t2 x 0 x x x x
t3 x x 0 x x x
t4 x x x 0 x x
t5 x x x x 0 x
t6 x x x x x 0
I am currently constructing this matrix manually as follows (In this SO question: Efficient and precise calculation of the euclidean distance this method has got the hightest performance).
e.g., to calculate euclidean distance between t3 and t6.
def eudis(v1, v2):
dist = [(a - b)**2 for a, b in zip(v1, v2)]
dist = math.sqrt(sum(dist))
return dist
eudis(t3, t6)
However, I am sure that there could be more easy and computationally efficient way to do this in python. Please let me know if you have suggestions.
I am happy to provide more details if needed.
Upvotes: 2
Views: 5014
Reputation: 30609
You can also use pdist
to get the distance matrix:
from scipy.spatial.distance import pdist, squareform
squareform(pdist(series))
So for relatively small datasets (up to about 20 series with 200 elements each) pdist
is fastest, for larger datasets euclidean_disances
performs much better. pure numpy
is mostly slower and may fail to allocate the intermediate array for large datasets.
Tested with np.random.randint(0, 100, (n, 10*n)).astype('int16')
, numpy 1.17.4, scipy 1.4.1, sklearn 0.23.1, python 3.8.2, Win10 64bit.
Upvotes: 3
Reputation: 4181
You can create a distance matrix in simple numpy in one line, you don't need anything else.
np.sqrt(((series[:,None,:] - series)**2).sum(axis=2))
Upvotes: 2
Reputation: 88276
You don't need to loop at all, for the euclidean distance between two arrays just compute the elementwise squares of the differences as:
def euclidean_distance(v1, v2):
return np.sqrt(np.sum((v1 - v2)**2))
And for the distance matrix, you have sklearn.metrics.pairwise.euclidean_distances
:
from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(a).round(2)
array([[0. , 2.83, 3.74, 0. , 2.83, 3.74],
[2.83, 0. , 2.83, 2.83, 0. , 2.83],
[3.74, 2.83, 0. , 3.74, 2.83, 0. ],
[0. , 2.83, 3.74, 0. , 2.83, 3.74],
[2.83, 0. , 2.83, 2.83, 0. , 2.83],
[3.74, 2.83, 0. , 3.74, 2.83, 0. ]])
np.allclose(
eudis(series[2], series[3]),
euclidean_distance(series[2], series[3])
)
# True
Upvotes: 2