Brandon Oson
Brandon Oson

Reputation: 11

Why is the graph changing on every run

there. So I build a Kmean cluster program; however, each time I run the program the plot changes. I dont know why its happening and if anyone could help that would be very appreciate.

import numpy as np
import matplotlib.pyplot as plt
import random
def cal_centroids(clusters, cluster_array,k):
    new_centroids= []
    for c in range(k):
        x= 0
        y=0
        count=0
        for i in range(len(clusters)):
            if clusters[i]==c:
                x+=cluster_array[i][0]
                y+=cluster_array[i][1]
                count+=1
        x/=count
        y/=count
        new_centroids.append([x,y])
    return new_centroids
def assign_clusters(centroids,cluster_array):
    clusters=[]
    for i in range(cluster_array.shape[0]):
        distances=[]
        for centroid in centroids:
            distances.append(calc_distance(centroid,cluster_array[i]))
        cluster=[z for z, val in enumerate(distances) if val==min(distances)]
        clusters.append(cluster[0])
    return clusters
def calc_distance(x1,x2):
    return (sum((x1-x2)**2))**0.5

#from here on its mostly storing data, initializing centroids and assigning cluster label to data

def kmean(data,no_clusters,iterations): 
    s= random.sample(range(data.shape[0]),no_clusters)
    centroids= []
    for i in s:
        centroids.append(data[i,:])
    clusters= assign_clusters(centroids,data)
    initial_centroids= [i for i in centroids]
    for i in range(0,iterations):
        centroids= cal_centroids(clusters,data,no_clusters)
        cluster= assign_clusters(centroids,data)
    dict_centroids= {}
    for i in range(no_clusters):
        dict_centroids[i]=[]
    for i in range(no_clusters):
        for j in range(data.shape[0]):
            if(clusters[j]==i):
                dict_centroids[i].append(data[j,:])
    return dict_centroids,centroids,clusters

def extract_file(file_name):
    file = open(file_name,'r')
    lines = [list(map(int, line.strip("\n").split(","))) for line in file]
    x= np.array(lines)
    return x
data= extract_file("backyard.txt")
dict_centroids,centroids,clusters= kmean(data,2,8)
x= data[:,0]
y= data[:,1]
fig=plt.figure()
scatter= plt.scatter(x,y,c=clusters,s=40)
for i,j in centroids:
    plt.scatter(i,j,s=50,c='red',marker= '+')
plt.xlabel("Vitamin C")
plt.ylabel("GLA")
plt.title("File backyard 2 groups Displayed")
fig.show()

the backyard list is this:

40,40
10,10
200,200
230,231
40,43 
15,45 
220,190

Upvotes: 0

Views: 101

Answers (1)

lafak
lafak

Reputation: 21

I haven't run your code, however, if the graph changes on every run there is nothing to worry about. K-means is an algorithm that uses a random start (which I'm assuming you did in your code with this line: s= random.sample(range(data.shape[0]),no_clusters)). There is no guarantee that K-means will converge to a global minimum, but it will converge to a local minimum depending on the random start. You could maybe try to fix your random start by setting a random seed with NumPy: numpy.random.seed(42)

Upvotes: 1

Related Questions