Reputation: 1369
In order to clusterize a set of time series I'm looking for a smart distance metric. I've tried some well known metric but no one fits to my case.
ex: Let's assume that my cluster algorithm extracts this three centroids [s1, s2, s3]:
I want to put this new example [sx] in the most similar cluster:
The most similar centroids is the second one, so I need to find a distance function d that gives me d(sx, s2) < d(sx, s1)
and d(sx, s2) < d(sx, s3)
edit
Here the results with metrics [cosine, euclidean, minkowski, dynamic type warping]
]3
edit 2
User Pietro P suggested to apply the distances on the cumulated version of the time series
The solution works, here the plots and the metrics:
Upvotes: 11
Views: 8877
Reputation: 11
Pietro P's answer is just a special case of applying a convolution to your time series.
If I gave the kernel:
[1,1,...,1,1,1,0,0,0,0,...0,0]
I would get a cumulative series .
Adding a convolution works because you're giving each data point information about it's neighbours - it's now order dependent.
It might be interesting to try with a guassian convolution or other kernels.
Upvotes: 1
Reputation: 1238
Another approach would be by utilizing DTW which is an algorithm to compute the similarity between two temporal sequences. Full disclosure; I coded a Python package for this purpose called trendypy
, you can download via pip (pip install trendypy
). Here is a demo on how to utilize the package. You're just just basically computing the total min distance for different combinations to set the cluster centers.
Upvotes: 2
Reputation: 1473
nice question! using any standard distance of R^n (euclidean, manhattan or generically minkowski) over those time series cannot achieve the result you want, since those metrics are independent of the permutations of the coordinate of R^n (while time is strictly ordered and it is the phenomenon you want to capture).
A simple trick, that can do what you ask is using the cumulated version of the time series (sum values over time as time increases) and then apply a standard metric. Using the Manhattan metric, you would get as a distance between two time series the area between their cumulated versions.
Upvotes: 11
Reputation: 747
what about using standard Pearson correlation coefficient? then you can assign the new point to the cluster with the highest coefficient.
correlation = scipy.stats.pearsonr(<new time series>, <centroid>)
Upvotes: 1