Reputation: 1363
I'm trying to calculate the Davies-Bouldin Index in Python.
Here are the steps the code below tries to reproduce.
5 Steps:
Then,
Finally,
Code
def daviesbouldin(X, labels, centroids):
import numpy as np
from scipy.spatial.distance import pdist, euclidean
nbre_of_clusters = len(centroids) #Get the number of clusters
distances = [[] for e in range(nbre_of_clusters)] #Store intra-cluster distances by cluster
distances_means = [] #Store the mean of these distances
DB_indexes = [] #Store Davies_Boulin index of each pair of cluster
second_cluster_idx = [] #Store index of the second cluster of each pair
first_cluster_idx = 0 #Set index of first cluster of each pair to 0
# Step 1: Compute euclidean distances between each point of a cluster to their centroid
for cluster in range(nbre_of_clusters):
for point in range(X[labels == cluster].shape[0]):
distances[cluster].append(euclidean(X[labels == cluster][point], centroids[cluster]))
# Step 2: Compute the mean of these distances
for e in distances:
distances_means.append(np.mean(e))
# Step 3: Compute euclidean distances between each pair of centroid
ctrds_distance = pdist(centroids)
# Tricky step 4: Compute Davies-Bouldin index of each pair of cluster
for i, e in enumerate(e for start in range(1, nbre_of_clusters) for e in range(start, nbre_of_clusters)):
second_cluster_idx.append(e)
if second_cluster_idx[i-1] == nbre_of_clusters - 1:
first_cluster_idx += 1
DB_indexes.append((distances_means[first_cluster_idx] + distances_means[e]) / ctrds_distance[i])
# Step 5: Compute the mean of all DB_indexes
print("DAVIES-BOULDIN Index: %.5f" % np.mean(DB_indexes))
In arguments:
X
is the datalabels
, are the labels computed by a clustering algorithm (i.e: kmeans)centroids
are the coordinates of each cluster's centroid (i.e: cluster_centers_
)Also, note that I'm using Python 3
QUESTION1: Is the computation of euclidean distances between each pair of centroid correct (step 3)?
QUESTION2: Is my implementation of step 4 correct?
QUESTION3: Do I need to normalise intra and inter cluster distances ?
Further explanations on Step 4
Let's say we have 10 clusters. The loop should compute the DB index of each pair of cluster.
At the first iteration:
distances_means
) and intra-distances mean of cluster 2 (index 1 of distances_means
)ctrds_distance
)At the second iteration:
distances_means
) and intra-distances mean of cluster 3 (index 2 of distances_means
)ctrds_distance
)and so on...
With the example of 10 clusters, the full iteration process should look like this:
intra-cluster distance intra-cluster distance distance between their
of cluster: of cluster: centroids(storage num):
0 + 1 / 0
0 + 2 / 1
0 + 3 / 2
0 + 4 / 3
0 + 5 / 4
0 + 6 / 5
0 + 7 / 6
0 + 8 / 7
0 + 9 / 8
1 + 2 / 9
1 + 3 / 10
1 + 4 / 11
1 + 5 / 12
1 + 6 / 13
1 + 7 / 14
1 + 8 / 15
1 + 9 / 16
2 + 3 / 17
2 + 4 / 18
2 + 5 / 19
2 + 6 / 20
2 + 7 / 21
2 + 8 / 22
2 + 9 / 23
3 + 4 / 24
3 + 5 / 25
3 + 6 / 26
3 + 7 / 27
3 + 8 / 28
3 + 9 / 29
4 + 5 / 30
4 + 6 / 31
4 + 7 / 32
4 + 8 / 33
4 + 9 / 34
5 + 6 / 35
5 + 7 / 36
5 + 8 / 37
5 + 9 / 38
6 + 7 / 39
6 + 8 / 40
6 + 9 / 41
7 + 8 / 42
7 + 9 / 43
8 + 9 / 44
The problem here is I'm not quite sure that the index of distances_means
matches the index of ctrds_distance
.
In other words, I'm not sure that the first inter-cluster distance computed corresponds to the distance between cluster 1 and cluster 2. And that the second inter-cluster distance computed corresponds to the distance between cluster 3 and cluster 1... and so on, following the pattern above.
In short: I'm afraid I'm dividing pairs of intra-cluster distances by an inter-cluster distance that is not corresponding.
Upvotes: 5
Views: 10286
Reputation: 11
DB Index is implemented in sk-learn starting from version 0.20. Have checked the results of implemented function and code written by me. Got the same result:
def DaviesBouldin(X, labels):
"""
:param X: array of data
:param labels: specific cluster for variable
:return: DB Index devided by number of clusters
"""
n_cluster = len(np.bincount(labels))
cluster_k = [X[labels == k] for k in range(n_cluster)]
centroids = [np.mean(k, axis = 0) for k in cluster_k]
variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
db = []
for i in range(n_cluster):
dbij = []
for j in range(n_cluster):
if j != i:
dbij.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]))
db.append(max(dbij))
dbi = np.mean(db)
return dbi
Upvotes: 1
Reputation: 26
This one is 20x faster than the below code, all computation is done in numpy.
import numpy as np
from scipy.spatial.distance import euclidean, cdist, pdist, squareform
def db_index(X, y):
"""
Davies-Bouldin index is an internal evaluation method for
clustering algorithms. Lower values indicate tighter clusters that
are better separated.
"""
# get unique labels
if y.ndim == 2:
y = np.argmax(axis=1)
uniqlbls = np.unique(y)
n = len(uniqlbls)
# pre-calculate centroid and sigma
centroid_arr = np.empty((n, X.shape[1]))
sigma_arr = np.empty((n,1))
dbi_arr = np.empty((n,n))
mask_arr = np.invert(np.eye(n, dtype='bool'))
for i,k in enumerate(uniqlbls):
Xk = X[np.where(y==k)[0],...]
Ak = np.mean(Xk, axis=0)
centroid_arr[i,...] = Ak
sigma_arr[i,...] = np.mean(cdist(Xk, Ak.reshape(1,-1)))
# compute pairwise centroid distances, make diagonal elements non-zero
centroid_pdist_arr = squareform(pdist(centroid_arr)) + np.eye(n)
# compute pairwise sigma sums
sigma_psum_arr = squareform(pdist(sigma_arr, lambda u,v: u+v))
# divide
dbi_arr = np.divide(sigma_psum_arr, centroid_pdist_arr)
# get mean of max of off-diagonal elements
dbi_arr = np.where(mask_arr, dbi_arr, 0)
dbi = np.mean(np.max(dbi_arr, axis=1))
return dbi
Here's an implementation using numpy 1.14, scipy 1.1.0, and python 3. Not much computational speed improvement but should have a slightly smaller memory footprint.
import numpy as np
from scipy.spatial.distance import euclidean, cdist, pdist, squareform
def db_index(X, y):
"""
Davies-Bouldin index is an internal evaluation method for
clustering algorithms. Lower values indicate tighter clusters that
are better separated.
Arguments
----------
X : 2D array (n_samples, embed_dim)
Vector for each example.
y : 1D array (n_samples,) or 2D binary array (n_samples, n_classes)
True labels for each example.
Returns
----------
dbi : float
Calculated Davies-Bouldin index.
"""
# get unique labels
if y.ndim == 2:
y = np.argmax(axis=1)
uniqlbls = np.unique(y)
n = len(uniqlbls)
# pre-calculate centroid and sigma
centroid_arr = np.empty((n, X.shape[1]))
sigma_arr = np.empty(n)
for i,k in enumerate(uniqlbls):
Xk = X[np.where(y==k)[0],...]
Ak = np.mean(Xk, axis=0)
centroid_arr[i,...] = Ak
sigma_arr[i,...] = np.mean(cdist(Xk, Ak.reshape(1,-1)))
# loop over non-duplicate cluster pairs
dbi = 0
for i in range(n):
max_Rij = 0
for j in range(n):
if j != i:
Rij = np.divide(sigma_arr[i] + sigma_arr[j],
euclidean(centroid_arr[i,...], centroid_arr[j,...]))
if Rij > max_Rij:
max_Rij = Rij
dbi += max_Rij
return dbi/n
Upvotes: 0
Reputation: 11
Thanks for the code and revision - really helped me to get started. The shorter, faster version is not entirely correct. I amended it to correctly average the dispersion scores of the most similar cluster for each cluster.
See https://www.researchgate.net/publication/224377470_A_Cluster_Separation_Measure for original algorithm and explanation:
The DBI is the average of the similarity measures of each cluster with its most similar cluster.
def DaviesBouldin(X, labels):
n_cluster = len(np.bincount(labels))
cluster_k = [X[labels == k] for k in range(n_cluster)]
centroids = [np.mean(k, axis = 0) for k in cluster_k]
# calculate cluster dispersion
S = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
Ri = []
for i in range(n_cluster):
Rij = []
# establish similarity between each cluster and all other clusters
for j in range(n_cluster):
if j != i:
r = (S[i] + S[j]) / euclidean(centroids[i], centroids[j])
Rij.append(r)
# select Ri value of most similar cluster
Ri.append(max(Rij))
# get mean of all Ri values
dbi = np.mean(Ri)
return dbi
Upvotes: 1
Reputation: 1363
Here is a shorter, faster corrected version of the Davies-Bouldin index naive implementation above.
def DaviesBouldin(X, labels):
n_cluster = len(np.bincount(labels))
cluster_k = [X[labels == k] for k in range(n_cluster)]
centroids = [np.mean(k, axis = 0) for k in cluster_k]
variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
db = []
for i in range(n_cluster):
for j in range(n_cluster):
if j != i:
db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]))
return(np.max(db) / n_cluster)
Answering my own questions:
Note you can find innovative approaches that try to improve this index, notably the "New Version of Davies-Bouldin Index" that replaces Euclidean distance by Cylindrical distance.
Upvotes: 2
Reputation: 95
thank you for your implementation. I just have one question: Is there missing a division in the last row. In the last step the value of max(db) should be divided by the implemented number of clusters.
def DaviesBouldin(Daten, DatenLabels):
n_cluster = len(np.bincount(DatenLabels))
cluster_k = [Daten[DatenLabels == k] for k in range(n_cluster)]
centroids = [np.mean(k, axis = 0) for k in cluster_k]
variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)] # mittlere Entfernung zum jeweiligen Clusterzentrum
db = []
for i in range(n_cluster):
for j in range(n_cluster):
if j != i:
db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]) / n_cluster)
return(np.max(db))
Maybe I oversee that division because I'm new to Python. But in my graphics (I'm iterating over a range of clusters) the value of DB.max is very low at the beginning and increases afterwards. After the Scaling by the number of clusters the graph looks better (high DB.max value at the beginning and constantly falling with increasing number of clusters).
Best regards
Upvotes: 1