Reputation: 11
there. So I build a Kmean cluster program; however, each time I run the program the plot changes. I dont know why its happening and if anyone could help that would be very appreciate.
import numpy as np
import matplotlib.pyplot as plt
import random
def cal_centroids(clusters, cluster_array,k):
new_centroids= []
for c in range(k):
x= 0
y=0
count=0
for i in range(len(clusters)):
if clusters[i]==c:
x+=cluster_array[i][0]
y+=cluster_array[i][1]
count+=1
x/=count
y/=count
new_centroids.append([x,y])
return new_centroids
def assign_clusters(centroids,cluster_array):
clusters=[]
for i in range(cluster_array.shape[0]):
distances=[]
for centroid in centroids:
distances.append(calc_distance(centroid,cluster_array[i]))
cluster=[z for z, val in enumerate(distances) if val==min(distances)]
clusters.append(cluster[0])
return clusters
def calc_distance(x1,x2):
return (sum((x1-x2)**2))**0.5
#from here on its mostly storing data, initializing centroids and assigning cluster label to data
def kmean(data,no_clusters,iterations):
s= random.sample(range(data.shape[0]),no_clusters)
centroids= []
for i in s:
centroids.append(data[i,:])
clusters= assign_clusters(centroids,data)
initial_centroids= [i for i in centroids]
for i in range(0,iterations):
centroids= cal_centroids(clusters,data,no_clusters)
cluster= assign_clusters(centroids,data)
dict_centroids= {}
for i in range(no_clusters):
dict_centroids[i]=[]
for i in range(no_clusters):
for j in range(data.shape[0]):
if(clusters[j]==i):
dict_centroids[i].append(data[j,:])
return dict_centroids,centroids,clusters
def extract_file(file_name):
file = open(file_name,'r')
lines = [list(map(int, line.strip("\n").split(","))) for line in file]
x= np.array(lines)
return x
data= extract_file("backyard.txt")
dict_centroids,centroids,clusters= kmean(data,2,8)
x= data[:,0]
y= data[:,1]
fig=plt.figure()
scatter= plt.scatter(x,y,c=clusters,s=40)
for i,j in centroids:
plt.scatter(i,j,s=50,c='red',marker= '+')
plt.xlabel("Vitamin C")
plt.ylabel("GLA")
plt.title("File backyard 2 groups Displayed")
fig.show()
the backyard list is this:
40,40
10,10
200,200
230,231
40,43
15,45
220,190
Upvotes: 0
Views: 101
Reputation: 21
I haven't run your code, however, if the graph changes on every run there is nothing to worry about. K-means is an algorithm that uses a random start (which I'm assuming you did in your code with this line: s= random.sample(range(data.shape[0]),no_clusters)
). There is no guarantee that K-means will converge to a global minimum, but it will converge to a local minimum depending on the random start.
You could maybe try to fix your random start by setting a random seed with NumPy: numpy.random.seed(42)
Upvotes: 1