Reputation: 3441

Input to the clustering algorithms

I have 250 time series and I'm going to cluster them to see which time series have more or less same behaviour. But my problem is that, whatever I searched in google and StackOverFlow I couldn't find an example that tells me whether I have to merge all of my time series together? or it is possible to keep them in separated variables? Any explanation about the input would help.

I am programming with python 3.6 and for clustering, I use sci-kit learn library
Each of my time series is a pandas dataframe with one column

Upvotes: 0

Answers (1)

user6655984

Reputation:

The input format of SciKit's clustering methods varies by method. Click method's name on the list of Classes, scroll down to the description of the fit method of the class; this is the one that does the clustering. For most methods, e.g., K-means the data must be in the form of 2D array of shape (n_samples, n_features). For you, the number of samples is 250 and the number of features is the length of time series (they would all have to be of exactly the same length).

But I'd be wary about using 2D array as input, because all the values will be treated as separate features, losing the idea of time parameter. If one series is just a shift of another, it may be treated as something completely different.

Some SciKit clustering methods allow you to precompute the 250 by 250 distance matrix (measuring how different two series are) or affinity / similarity matrix (measuring how similar they are). That can be passed in instead of the actual data. The matrix could be computed in a double loop, 250 by 250 is not too bad. These are the methods that can take a square matrix instead of original data:

I suggest doing some research on time-series similarity measures (to be used for computing that square matrix) before proceeding to clustering.

Upvotes: 1

Input to the clustering algorithms

Answers (1)

Related Questions