Reputation: 651
Just like the picture,why not just choose the point 2 as the second point of the cluster?But go to generate a random number bettwen [0,1]?
def initialize(X, K):#kmean++
m,n=shape(X)
C = mat(zeros((K,n)))
random_number=random.randint(0,m)
C[0,:]=X[random_number]
for k in range(1, K):
D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
probs = D2/D2.sum()
cumprobs = probs.cumsum()
r = scipy.rand()
for j,p in enumerate(cumprobs):
if r < p:
i = j
break
C[k,:]=X[i]
return C
Why generate r to compare with the p(the cumaltive probablity in picture is Sum)?
Upvotes: 0
Views: 243
Reputation: 4953
Because for the analysis of the behavior it is easier to understand what's happening when dealing with probability driven selections.
Intuitively, you don't want to chose the farthest point as it might be an outlier.
You want to chose a point which is probably part of a mass which is pretty far.
For that purpose the usage of PDF for selection works well.
Upvotes: 1