Reputation: 119
I read the paper k-means++: The Advantages of Careful Seeding and didn't quite understand the algorithm provided which is:
"Let D(x) denote the shortest distance from a data point x to the closest center we have already chosen.
1a. Choose an initial center c1 uniformly at random from X.
1b. Choose the next center ci, selecting ci = x' ∈ X with probability (D(x')^2) / Sum_of(D(x)^2)
1c. Repeat Step 1b until we have chosen a total of k centers.
2-4. Proceed as with the standard k-means algorithm"
(Better look at the algorithm in the link above)
Especially step 1b. What do they mean by "selecting ci = x' ∈ X with probability (D(x')^2) / Sumof(D(x)^2)". Do they mean selecting the element that has the largest proportion ? And how can perform such computation can lead to select the optimal centroids ?
Upvotes: 5
Views: 1495
Reputation: 1
The most difference between K-Means and K-Means++ is the way the initial centers are choosen. K-means selects the initial centers randomly. Before selecting initial centers, K-Menas++ calculates all distances between points and the first center to make all centers that would be selected far away from each other. If one center is chooesn near others, the points assigned to the corresponding center need to iterate many times to separate the clusters. However, the probability they belong to one cluster is very small if the distance of two centers is far enough, and it is easy to separate the clusters.
Upvotes: 0
Reputation: 99094
The function D(x) is defined for all points x ∈ X.
In step 1b, we must choose some point to be a new center. We will choose randomly between all points (that are not already centers). But we will not give an equal chance to every point; we will assign different probabilities to different points before we choose. These probabilities must add up to 1.
Consider D(x)^2. We can evaluate this at every point, and add up the values: Sum_of(D(x)^2).
Then we can assign each point x' a probability equal to D(x')^2 / Sum_of(D(x)^2). These probabilities add up to 1, and give a better chance to points far away from all existing centers.
Upvotes: 5
Reputation: 4408
http://en.wikipedia.org/wiki/K-means%2B%2B
1. Choose one center uniformly at random from among the data points.
2. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
4. Repeat Steps 2 and 3 until k centers have been chosen.
5. Now that the initial centers have been chosen, proceed using standard k-means clustering.
Upvotes: 0