Masood_mj
Masood_mj

Reputation: 1184

Is Triangle inequality necessary for kmeans?

I wonder if Triangle inequality is necessary for the distance measure used in kmeans.

Upvotes: 4

Views: 3337

Answers (3)

JeeyCi
JeeyCi

Reputation: 597

as so as Dimensions is max number of Lineary independent vectors in a vector space & in Linear Algebra there're 4 main subspaces (Row space, Column space, Null space & Left Null space) - you can calculate distance in tabular data either with Manhattan distance (no triangulation needed) or with Euclidean distance (triangulation needed). but anyway in Cartesian coordinate system (aka 2d [rows-columns]) taking into consideration cos() among vectors from points (when using triangulation - & K-means usually do).

As so as cos(90˚C)=0 (ref "the closer to 0 - the closer the two vectors are to being orthogonal") - thus, yes, triangular inequality is needed. If your points lie in one axis similary [cos=0] (no angle among them) then distance is pure distance among them in one space & 0 in another space due to cos(90˚C)=0. So to do projection matrix of features, you'd better to consider triangular inequality (like K-Means do using Euclidean distance, not manhattan)

p.s. cos(90˚C)=0 leads to Curse of Dimensionality pointed here

P.P.S. be carefull: Euclidean distance (in K-means) gives misleading info about Outliers (k-medians seems better to use, k-medoids seems arguable)

P.P.P.S for classification tasks you can not care about distance [follow link] (care about best cross-validated result), just for clustering care about distance you choose for your algorithm

Upvotes: 0

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77495

k-means is designed for Euclidean distance, which happens to satisfy triangle inequality.

Using other distance functions is risky, as it may stop converging. The reason however is not the triangle inequality, but the mean might not minimize the distance function. (The arithmetic mean minimizes the sum-of-squares, not arbitrary distances!)

There are faster methods for k-means that exploit the triangle inequality to avoid recomputations. But if you stick to classic MacQueen or Lloyd k-means, then you do not need the triangle inequality.

Just be careful with using other distance functions to not run into an infinite loop. You need to prove that the mean minimizes your distances to the cluster centers. If you cannot prove this, it may fail to converge, as the objective function no longer decreases monotonically! So you really should try to prove convergence for your distance function!

Upvotes: 4

Bitwise
Bitwise

Reputation: 7805

Well, classical kmeans is defined on Euclidean space with L2 distance, so you get triangle inequality automatically from that (triangle inequality is part of how a distance/metric is defined). If you are using a non-euclidean metric, you would need to define what is the meaning of the "mean", amongst other things.

If you don't have triangle inequality, it means that two points could be very far from each other, but both can be close to a third point. You need to think how you would like to interpret this case.

Having said all that, I have in the past used average linkage hierarchical clustering with a distance measure that did not fulfill triangle inequality amongst other things, and it worked great for my needs.

Upvotes: 3

Related Questions