Reputation: 11
I am trying to do K-Means clustering from scratch in Python. Here is my code, there is a problem with the way I redefine the centroids
This this the output I get:
Iteration 1:
[1.5, 8.1] [8.04, 1.525]
Iteration 2:
[4.98, 4.05] [2.87, 4.09]
Iteration 3:
[9.29, 8.28] [8.57, 7.87]
Iteration 4:
[9.97, 8.94] [inf, inf]
Thanks in advance!
# Example dataset
data = pd.DataFrame({'x' : [6.480, 7.320, 4.380, 8.040, 7.680, 6.600, 6.420, 5.940, 4.140, 5.700,
7.500, 7.620, 6.840, 7.500, 4.920, 3.780, 7.860, 4.260, 7.980, 6.840,
3.025, 2.300, 3.250, 2.975, 3.325, 1.500, 1.875, 2.850, 1.600, 2.525,
2.900, 2.175, 2.050, 1.650, 2.250, 3.475, 1.800, 2.975, 3.025, 2.175 ],
'y' : [6.300, 5.220, 6.060, 4.560, 7.080, 4.740, 3.660, 4.680, 4.800, 5.880,
8.100, 7.800, 3.900, 6.780, 4.860, 5.100, 4.380, 5.160, 5.520, 5.700,
2.125, 3.475, 2.500, 2.875, 2.075, 3.350, 1.525, 3.050, 2.950, 2.150,
2.125, 2.550, 3.375, 1.950, 1.700, 2.400, 2.525, 2.525, 2.675, 3.325]})
data['Cluster'] = 0
data['EuclideanDist1'] = 0
data['EuclideanDist2'] = 0
data['EuclideanDistD'] = 0
iterations = 0
C1nx = C1ny = 0
C2nx = C2ny = 0
C1c = 0
C2c = 0
C1 = [min(data['x']), max(data['y'])]
C2 = [max(data['x']), min(data['y'])]
count = 0
while(iterations < 40):
print(C1, C2)
for count in range(0, len(data)-1):
data['EuclideanDist1'][count] = ((data['x'][count] - C1[0])**2 + (data['y'][count] - C1[1])**2)**(0.5)
data['EuclideanDist2'][count] = ((data['x'][count] - C2[0])**2 + (data['y'][count] - C2[1])**2)**(0.5)
data['EuclideanDistD'][count] = data['EuclideanDist1'] [count]- data['EuclideanDist2'][count]
if data['EuclideanDistD'][count] >= 0:
data['Cluster'][count] = 1
C1nx = C1nx + data['x'][count]
C1ny = C1ny + data['y'][count]
C1c = C1c + 1
elif data['EuclideanDistD'][count] < 0:
data['Cluster'][count] = 2
C2nx = C2nx + data['x'][count]
C2ny = C2ny + data['y'][count]
C2c = C2c + 1
C1[0] = (C1nx / C1c)
C1[1] = (C1ny / C1c)
C2[0] = (C2nx / C2c)
C2[1] = (C2ny / C2c)
C1n = [0,0]
C2n = [0,0]
C1c = 0
C2c = 0
iterations = iterations + 1
Upvotes: 1
Views: 559
Reputation: 77454
Avoid stacking lookups [a][b]
inside loops. It's not very readable, and slow. Use local variables appropriately. Since Python is fairly slow in interpreter mode, try to use vectorised numpy operations where possible to benefit from their faster C/Fortran code.
Upvotes: 1