Reputation: 1356
All the k-means algorithms are trying in one way or another to find k points such that if you would map any point from the original data set to it's closest point from those k points you would minimize the sum of squared distances to those points.
The problem with this cost function:
Imagine the following 1 dimensional case (i.e., numbers instead of vectors) with k = 2. Let's call the ground truth points A and B such that A = -1 and B = 1. And let's call the points an optimal k-means algorithm would return C and D such that C and D correspond to A and B respectively. Now let's say we have a large dateset that was created from points around A and B with some normal distribution. Assuming the variance is large enough we expect that a percentage of points from A would be positive and a percentage of the points from B would be negative, because of that those points would be mapped to the wrong points and that would make C and D closer to each-other than A and B and as the variance increases C and D both approaching 0.
Solution to this problem?
This problem sounds so fundamental to me that i was sure I could find something about it however when I searched I couldn't find anything on this problem. So my question is is there any paper/algorithm that addresses this problem and tries to solve this? Even for special cases when assuming normal distribution of the data or having any other assumptions for the distribution of the data? It's just weird to me that I didn't find anywhere any mention of this problem what so ever.
Edit:
This actually causes the clusters to repel each other not attract. This is because (if we assume equal variance and same number of points to each cluster) for every point from cluster 1 that is missclassified we have a point from cluster 2 which is also missclasified, but those points are closer to the other cluster then the original point so the cluster means become farther apart.
So if we want to fix it we can solve k-means then somehow make cluster points closer to each other based on their distances and variances.
Upvotes: 2
Views: 256
Reputation:
This problem is a universal phenomenon related to the so-called Bayes classification error, which is the lowest error that you can achieve when the distributions of the classes overlap. (In practice, the classifiers achieve worse than this theoretical bound.)
It clearly means that in case of overlaps, it is impossible to avoid a percentage of misclassification, whatever the method. There are big cats and small tigers.
In fact, it is a very common, if not ubiquitous situation. The only way to improve on it is to use additional features for classification, i.e. increase the dimension of the data space. Such as telling the color in addition to the size.
Upvotes: 2
Reputation: 12151
In most contexts, the k-means algorithm is defined to be an optimization problem that minimizes the within-cluster variance of each cluster.
In the case of infinite variance of the data, the problem you are hitting against is the problem of the non-uniqueness of the solution to the k-means algorithm. As the true variance of both underlying distributions approaches infinity, the set of optimal solutions to the k-means problem formulation is infinite. That is, any set of means returns the exact same quality of fit (theoretically.) In practice, the k-means algorithm will simply over-fit the (finite) noise in your data and select an arbitrary pair of means.
Simply put, the k-means problem definition is not appropriate in the infinite variance case you describe. Note that in the case of infinite variance, you have much bigger problems than k-means not yielding a useful solution. Other fundamental properties (such as the Central Limit Theorem) are no longer guaranteed.
In the case of finite but large variance, we expect the k-means problem to be ill-conditioned. To see why, consider the Central Limit Theorem (CLT) in the context of k-means.
The CLT states that the means computed during k-means converge to the distribution of a normal with mean equal to the true mean (of the data) and variance equal to sigma^2 / sqrt(n)
, where sigma^2
is the variance of your data and n
is the number of samples. As sigma^2
approaches infinity, n
must approach infinity (quadratically) in order to have a reasonable chance of estimating the true mean accurately.
Simply put, the solution to your problem is to do the appropriate exploratory analysis of your data to determine how high your variance is, and if the number of samples you have is sufficient for the expected variance. If not, go back and collect more data, or apply a different technique.
Upvotes: 2