Reputation: 61
i have kmeans clustered data, and cluster centroids of the kmeans. I want to calculate density of each cluster centroid and remove the cluster of the highest cluster centroid density. I did my research, and find this formula.
N(c) is a set of neighbor cluster centroids of cluster c and should be 5 I tried to implement the algorithm but couldn't do. Can you help me implement it?
Here is my code so far:
df = make_blobs(n_samples=5000, n_features=15,centers=15, cluster_std=1,random_state=10)
X,y=df
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)
TrainData=X_train,y_train
n_clusters_sampling=10
kmeans2 = KMeans(n_clusters = n_clusters_sampling,random_state=10)
kmeans2.fit(X_train)
centroids = kmeans2.cluster_centers_
Upvotes: 2
Views: 1076
Reputation: 1772
Your problem is essentially a "k-nearest neighbor search" on the "new dataset" formed by the centroids. You want the 5 closest ones for each centroid and their associated distances. Fortunately, sklearn
does have the NearestNeighbors class which provides this functionality:
...
centroids = kmeans2.cluster_centers_
from sklearn.neighbors import NearestNeighbors
nn = NearestNeighbors(n_neighbors=6) # 6 is not a typo. Explanation below.
nn.fit(centroids)
neigh_dist, neigh_ind = nn.kneighbors(centroids, return_distance=True)
densities = [5/np.sum(neigh_dist[i,:]) for i in range(centroids.shape[0])]
print(densities)
Note that we are fitting the nn
object with the same datapoints (the centroids) we are performing the queries on.
This is why n_neighbors
is 6: for each centroid, itself will be its closest neighbor with zero distance.
The .kneighbors()
method, when return_distance
is set to True, (also) returns an array of distances of shape (n
, n_neighbors
), where n
is the number of query points - i.e. the centroids. The i,j cell of that array tells you the distance of neighbor j from centroid i. So we take averages per row to compute the densities following the formula you are posting.
Edit: The next part of the answer addresses the OP's comment for removing the cluster of highest density.
Removing a cluster, say, c
essentially means reassigning the cluster labels of its datapoints to the next closest centroid. So, now we have a new 1-nearest neighbor problem and we can use again the NearestNeihbors object we have created.
We perform a 2-nearest neighbor search on the "centroids dataset" for the points that were initially assigned to c
.
The first neighbor will of course be c
, so we keep only the second nearest neighbor.
Then we simply update the original assignment table for those datapoints with the new indices.
# run k-means and get an array of initial cluster assignments.
assignments = kmeans2.predict(X_train)
# find the index of cluster to be removed
c = np.argmax(densities)
# for each point originally assigned to c, find its closest centroid.
# Again we are using the trick of searching for one more neighbor since we know
# the closest centroid of those points will be c.
nearest_centroids = nn.kneighbors(X_train[assignments==c,:], n_neighbors=2, return_distance=False)
# get the new closest cenroid (that is, the second column of the array) and make it 1D
nearest_centroids = nearest_centroids[:,1].flatten()
# simply update the initial assignment table for the specific datapoints
assignments[assignments==c] = nearest_centroids
The assignments
array now does not contain the value of c
. Note that this may leave a "hole" when plotting or doing other post-processing of the results as there will be a cluster with no points assigned to it. If you want to avoid this, simply subtract one to the indices higher than c:
assignments = np.array([i-1 if i>c else i for i in assignments])
and if you want to remove the centroid as well:
centroids = np.delete(centroids, c, axis=0) # remove row from numpy array by index
Upvotes: 1
Reputation: 39
you could do it with Manhattan Distance the formula is d(p, q) = d(q,p) = Sum (|qi-pi|)
def ManhattanDistance(x, y):
S = 0;
for i in range(len(x)):
S += abs(int(x[i])- int(y[i]))
return S
and her you could get the shot distance
def Classify(centroieds, item):
minimum =10000;
for i in range(len(centroieds)):
# Find distance from item to mean
dis = ManhattanDistance(item, centroieds[i]);
if (dis < minimum):
minimum = dis;
index = i;
return index;
and her you could find the clastre that fit and i put all the cluster in a dict
def FindClusters(centroieds):
clusters = {} # Init clusters
for i in range(200):
item = l[i]
# Classify item into a cluster
index = Classify(means, item);
# Add item to cluster
if index in clusters.keys():
clusters[index].append()
else:
clusters[index] = l[i]
clusters[index].append(l[i])
return(clusters)
that is not all my code it is parts of it i hope it is help.
Upvotes: 0