PythonExperiments
PythonExperiments

Reputation: 11

K-means Clustering from scratch in Python - redefining the centroids

I am trying to do K-Means clustering from scratch in Python. Here is my code, there is a problem with the way I redefine the centroids

This this the output I get:

Iteration 1:
[1.5, 8.1] [8.04, 1.525]
Iteration 2:
[4.98, 4.05] [2.87, 4.09]
Iteration 3:
[9.29, 8.28] [8.57, 7.87]
Iteration 4:
[9.97, 8.94] [inf, inf]

Thanks in advance!

# Example dataset
data = pd.DataFrame({'x' : [6.480, 7.320, 4.380, 8.040, 7.680, 6.600, 6.420, 5.940, 4.140, 5.700,
                            7.500, 7.620, 6.840, 7.500, 4.920, 3.780, 7.860, 4.260, 7.980, 6.840,
                            3.025, 2.300, 3.250, 2.975, 3.325, 1.500, 1.875, 2.850, 1.600, 2.525,
                            2.900, 2.175, 2.050, 1.650, 2.250, 3.475, 1.800, 2.975, 3.025, 2.175 ],

                     'y' : [6.300, 5.220, 6.060, 4.560, 7.080, 4.740, 3.660, 4.680, 4.800, 5.880,
                            8.100, 7.800, 3.900, 6.780, 4.860, 5.100, 4.380, 5.160, 5.520, 5.700,
                            2.125, 3.475, 2.500, 2.875, 2.075, 3.350, 1.525, 3.050, 2.950, 2.150, 
                            2.125, 2.550, 3.375, 1.950, 1.700, 2.400, 2.525, 2.525, 2.675, 3.325]})

data['Cluster'] = 0
data['EuclideanDist1'] = 0
data['EuclideanDist2'] = 0
data['EuclideanDistD'] = 0   

iterations = 0

C1nx = C1ny =  0
C2nx = C2ny =  0
C1c = 0
C2c = 0

C1 = [min(data['x']), max(data['y'])]
C2 = [max(data['x']), min(data['y'])]

count = 0

while(iterations < 40):
    print(C1, C2)    
    for count in range(0, len(data)-1):

        data['EuclideanDist1'][count] = ((data['x'][count] - C1[0])**2 + (data['y'][count] - C1[1])**2)**(0.5)
        data['EuclideanDist2'][count] = ((data['x'][count] - C2[0])**2 + (data['y'][count] - C2[1])**2)**(0.5)
        data['EuclideanDistD'][count] = data['EuclideanDist1'] [count]- data['EuclideanDist2'][count]  


        if data['EuclideanDistD'][count] >= 0:
            data['Cluster'][count] = 1
            C1nx = C1nx + data['x'][count]
            C1ny = C1ny + data['y'][count]
            C1c = C1c + 1

        elif data['EuclideanDistD'][count] < 0:
            data['Cluster'][count] = 2
            C2nx = C2nx + data['x'][count]
            C2ny = C2ny + data['y'][count]       
            C2c = C2c + 1

    C1[0] = (C1nx / C1c)
    C1[1] = (C1ny / C1c)
    C2[0] = (C2nx / C2c)
    C2[1] = (C2ny / C2c)

    C1n = [0,0]
    C2n = [0,0]
    C1c = 0
    C2c = 0

    iterations = iterations + 1

Upvotes: 1

Views: 559

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

  1. K-means does not use Euclidean distance. Remove the sqrt.
  2. Clusters can become empty with bad starting conditions. Then you get a division by 0, and NaN values.
  3. You logic is wrong. You assign every point to the farthest cluster, that is why you always run into above problem. Plus, it won't be easy to increase k > 2.

Avoid stacking lookups [a][b] inside loops. It's not very readable, and slow. Use local variables appropriately. Since Python is fairly slow in interpreter mode, try to use vectorised numpy operations where possible to benefit from their faster C/Fortran code.

Upvotes: 1

Related Questions