Long Phan
Long Phan

Reputation: 119

Single Pass Seed Selection Algorithm for k-Means

I've recently read the Single Pass Seed Selection Algorithm for k-Means article, but not really understand the algorithm, which is:

  1. Calculate distance matrix Dist in which Dist (i,j) represents distance from i to j
  2. Find Sumv in which Sumv (i) is the sum of the distances from ith point to all other points.
  3. Find the point i which is min (Sumv) and set Index = i
  4. Add First to C as the first centroid
  5. For each point xi, set D (xi) to be the distance between xi and the nearest point in C
  6. Find y as the sum of distances of first n/k nearest points from the Index
  7. Find the unique integer i so that D(x1)^2+D(x2)^2+...+D(xi)^2 >= y > D(x1)^2+D(x2)^2+...+D(x(i-1))^2
  8. Add xi to C
  9. Repeat steps 5-8 until k centers

Especially step 6, do we still use the same Index (same point) over and over or we use the newly added point from C? And about step 8, does i have to be larger than 1?

Upvotes: 2

Views: 556

Answers (2)

user3300314
user3300314

Reputation: 1

Single Pass Seed Selection algorithm is a novel algorithm. Single Pass mean that without any iterations first seed can be selected. k-means++ performance is depends on first seed. It is overcome in SPSS. Please gothrough the paper "Robust Seed Selestion Algorithm for k-means" from the same authors

John J. Louis

Upvotes: -1

Raff.Edward
Raff.Edward

Reputation: 6514

Honestly, I wouldn't worry about understanding that paper - its not very good.

  • The algorithm is poorly described.
  • Its not actually a single pass, it needs do to n^2/2 pairwise computations + one additional pass through the data.
  • They don't report the runtime of their seed selection scheme, probably because it is very bad doing O(n^2) work.
  • They are evaluating on very simple data sets that don't have a lot of bad solutions for k-Means to fall into.
  • One of their metrics of "better"ness is how many iterations it takes k-means to run given the seed selection. While it is an interesting metric, the small differences they report are meaningless (k-means++ seeding could be more iterations, but less work done per iteration), and they don't report the run time or which k-means algorithm they use.

You will get a lot more benefit from learning and understanding the k-means++ algorithm they are comparing against, and reading some of the history from that.

If you really want to understand what they are doing, I would brush up on your matlab and read their provided matlab code. But its not really worth it. If you look up the quantile seed selection algorithm, they are essentially doing something very similar. Instead of using the distance to the first seed to sort the points, they appear to be using the sum of pairwise distances (which means they don't need an initial seed, hence the unique solution).

Upvotes: 4

Related Questions