ileadall42
ileadall42

Reputation: 651

The K-Means++ Algorithm - Explain the Choice of the Next Cluster Center

enter image description here

Just like the picture,why not just choose the point 2 as the second point of the cluster?But go to generate a random number bettwen [0,1]?

def initialize(X, K):#kmean++
m,n=shape(X)
C = mat(zeros((K,n)))
random_number=random.randint(0,m)
C[0,:]=X[random_number]
for k in range(1, K):
    D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
    probs = D2/D2.sum()
    cumprobs = probs.cumsum()
    r = scipy.rand()
    for j,p in enumerate(cumprobs):
        if r < p:
            i = j
            break
    C[k,:]=X[i]
return C

Why generate r to compare with the p(the cumaltive probablity in picture is Sum)?

Upvotes: 0

Views: 243

Answers (1)

Royi
Royi

Reputation: 4953

Because for the analysis of the behavior it is easier to understand what's happening when dealing with probability driven selections.

Intuitively, you don't want to chose the farthest point as it might be an outlier.
You want to chose a point which is probably part of a mass which is pretty far.
For that purpose the usage of PDF for selection works well.

Upvotes: 1

Related Questions