Travis
Travis

Reputation: 659

Appropriate clustering method for 1 or 2 dimensional data

I have a set of data I have generated that consists of extracted mass (well, m/z but that not so important) values and a time. I extract the data from the file, however, it is possible to get repeat measurements and this results in a large amount of redundancy within the dataset. I am looking for a method to cluster these in order to group those that are related based on either similarity in mass alone, or similarity in mass and time.

An example of data that should be group together is:

m/z time

337.65 1524.6

337.65 1524.6

337.65 1604.3

However, I have no way to determine how many clusters I will have. Does anyone know of an efficient way to accomplish this, possibly using a simple distance metric? I am not familiar with clustering algorithms sadly.

Upvotes: 3

Views: 8140

Answers (3)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

Why don't you just set a threshold?

If successive values (by time) do not differ by at least +-0.1 (by m/s) they a grouped together. Alternatively, use a relative threshold: differ by less than +- .1%. Set these thresholds according to your domain knowledge.

That sounds like the straightforward way of preprocessing this data to me.

Using a "clustering" algorithm here seems total overkill to me. Clustering algorithms will try to discover much more complex structures than what you are trying to find here. The result will likely be surprising and hard to control. The straightforward change-threshold approach (which I would not call clustering!) is very simple to explain, understand and control.

Upvotes: 1

hackartist
hackartist

Reputation: 5264

http://en.wikipedia.org/wiki/Cluster_analysis

http://en.wikipedia.org/wiki/DBSCAN

Read the section about hierarchical clustering and also look into DBSCAN if you really don't want to specify how many clusters in advance. You will need to define a distance metric and in that step is where you would determine which of the features or combination of features you will be clustering on.

Upvotes: 2

ElKamina
ElKamina

Reputation: 7807

For the simple one dimension K-means clustering (http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm) is appropriate and can be used directly. The only issue is selecting appropriate K. The best way to select a good K is to either plot K vs residual variance and select the K that "dramatically" reduces variance. Another strategy is to use some information criteria (eg. Bayesian Information Criteria).

You can extend K-Means to multi-dimensional data easily. But you should be beware of scaling the individual dimensions. Eg. Among items (1KG, 1KM) (2KG, 2KM) the nearest point to (1.7KG, 1.4KM) is (2KG, 2KM) with these scales. But once you start expression second item in meters, probably the alternative is true.

Upvotes: 0

Related Questions