Reputation: 414
I'm trying to cluster time-series data: I have about 16000 time-series vectors, each vector is ~1500 samples long.
I tried using the dtw package:
d = dist(x = time_series, method = "DTW")
hclust(d)
however the distance matrix calculation didn't finish running throughout the whole weekend.
I'm looking for a faster way since my data set will be much larger.
Upvotes: 0
Views: 746
Reputation: 254
Your data is on length 1500. Suppose it is oversampled..
If you downsample it 1 in 2, DTW will be 4 times faster. If you downsample it 1 in 4, DTW will be 16 times faster. If you downsample it 1 in 10, DTW will be 100 times faster.
This might be a good starting point.
Are you using cDTW or DTW? The former is significant faster, and can often be more accurate.
A paper in SIGKDD this week has a faster way to cluster DTW by using upper and lower bounds [a].
However, your matrix is of size (16000 * 15999)/2.
So if you have two days: two days / (16000 * 15999)/2 = 337 microseconds
So you need to do each comparison in 337 microseconds, that is not a lot of time. This will be difficult..., but it is doable with effort. If you get stuck, email me (I am the last author of [a])
[a] Nurjahan Begum, Liudmila Ulanova, Jun Wang, Eamonn Keogh (2015). Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy SIGKDD 2015
Upvotes: 3