How can i calculate density of every centroid in python?

Question

i have kmeans clustered data, and cluster centroids of the kmeans. I want to calculate density of each cluster centroid and remove the cluster of the highest cluster centroid density. I did my research, and find this formula.

N(c) is a set of neighbor cluster centroids of cluster c and should be 5 I tried to implement the algorithm but couldn't do. Can you help me implement it?

Here is my code so far:

df = make_blobs(n_samples=5000, n_features=15,centers=15, cluster_std=1,random_state=10)
X,y=df
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)
TrainData=X_train,y_train
n_clusters_sampling=10
 
kmeans2 = KMeans(n_clusters = n_clusters_sampling,random_state=10)
kmeans2.fit(X_train)
centroids = kmeans2.cluster_centers_

kyriakosSt · Accepted Answer

Your problem is essentially a "k-nearest neighbor search" on the "new dataset" formed by the centroids. You want the 5 closest ones for each centroid and their associated distances. Fortunately, sklearn does have the NearestNeighbors class which provides this functionality:

...
centroids = kmeans2.cluster_centers_

from sklearn.neighbors import NearestNeighbors
nn = NearestNeighbors(n_neighbors=6) # 6 is not a typo. Explanation below.
nn.fit(centroids)
neigh_dist, neigh_ind = nn.kneighbors(centroids, return_distance=True)
densities = [5/np.sum(neigh_dist[i,:]) for i in range(centroids.shape[0])]
print(densities)

Note that we are fitting the nn object with the same datapoints (the centroids) we are performing the queries on. This is why n_neighbors is 6: for each centroid, itself will be its closest neighbor with zero distance.
The .kneighbors() method, when return_distance is set to True, (also) returns an array of distances of shape (n, n_neighbors), where n is the number of query points - i.e. the centroids. The i,j cell of that array tells you the distance of neighbor j from centroid i. So we take averages per row to compute the densities following the formula you are posting.

Edit: The next part of the answer addresses the OP's comment for removing the cluster of highest density.

Removing a cluster, say, c essentially means reassigning the cluster labels of its datapoints to the next closest centroid. So, now we have a new 1-nearest neighbor problem and we can use again the NearestNeihbors object we have created.

We perform a 2-nearest neighbor search on the "centroids dataset" for the points that were initially assigned to c.
The first neighbor will of course be c, so we keep only the second nearest neighbor.
Then we simply update the original assignment table for those datapoints with the new indices.

# run k-means and get an array of initial cluster assignments.
assignments = kmeans2.predict(X_train)

# find the index of cluster to be removed
c = np.argmax(densities)

# for each point originally assigned to c, find its closest centroid.
# Again we are using the trick of searching for one more neighbor since we know
# the closest centroid of those points will be c.
nearest_centroids = nn.kneighbors(X_train[assignments==c,:], n_neighbors=2, return_distance=False)

# get the new closest cenroid (that is, the second column of the array) and make it 1D
nearest_centroids = nearest_centroids[:,1].flatten()

# simply update the initial assignment table for the specific datapoints
assignments[assignments==c] = nearest_centroids

The assignments array now does not contain the value of c. Note that this may leave a "hole" when plotting or doing other post-processing of the results as there will be a cluster with no points assigned to it. If you want to avoid this, simply subtract one to the indices higher than c:

assignments = np.array([i-1 if i>c else i for i in assignments])

and if you want to remove the centroid as well:

centroids = np.delete(centroids, c, axis=0) # remove row from numpy array by index

How can i calculate density of every centroid in python?

Answers (2)

Related Questions