Reputation: 98
I am trying to implement a k-means clustering algorithm from scratch using python. I am having problems updating the centroid values for each cluster. The code below shows where I am up to so far. I have initially clustered each data point into one of k clusters. AllData contains 329 rows; each row is a word followed by 300 features followed by the number of the cluster it has been assigned to (values 1 to 4). What I am trying to do in my loop is start off by creating an array A which only holds the rows from AllData that have been assigned to the first cluster. Then I want to take the mean of each of the feature columns in A and update the centroid to this. The loop should iteratively do this for all 4 clusters.
k = 4
i = 1
while (i <= k):
A = AllData[:,1:301][AllData[:,301] == i]
centroids[i-1:i,:] = A.mean(axis=0)
i = i + 1
The values of the 4 rows in the centroids array are updating correctly. The problem I am having is that the 4 updated centroid values are also rewriting over the first 4 rows of AllData. I don't want this to happen. The AllData array should remain unchanged. Any help would be much appreciated!
Upvotes: 2
Views: 1806
Reputation: 77454
In Python, as in the majority of programming languages, arrays begin with index 0. So you are skipping the first column, and accessing beyond the last column.
You can use array[array[:,-1]==i,:-1]
although I would recommend to keep the input data and the labels separate
Upvotes: 1