Reputation: 37
I would like to implement the simple hierarchical agglomerative clustering according to the pseudocode:
I got stuck at the last part where I need to update the distance matrix. So far I have:
import numpy as np
X = np.array([[1, 2],
[0, 3],
[2, 3],])
# Clusters
C = np.zeros((X.shape[0], X.shape[0]))
# Keeps track of active clusters
I = np.zeros(X.shape[0])
# For all n datapoints
for n in range(X.shape[0]):
for i in range(X.shape[0]):
# Compute the similarity of all N x N pairs of images
C[n][i] = np.linalg.norm(X[n] - X[i])
I[n] = 1
# Collects clustering as a sequence of merges
A = []
In each of N iterations
for k in range(X.shape[0] - 1):
# TODO: Find the indices of the smallest distance
# Updated distance matrix
I would like to implement the single-linkage clustering, so I would like to find the argmin of the distance matrix. I originally thought about doing something like:
i, m = np.where(C == np.min(C[np.nonzero(C)]))
i, m = i[0], m[0]
A.append((i, m))
to find the argmin, but I think it is incorrect as it doesn't specify a condition on the active clusters in I. I am also confused because I should just be looking at the upper or lower triangle of the matrix, so if I use the above method I could get the same argmin twice due to symmetry.
I was also thinking about first creating the rows and columns of the new merged cluster:
C = np.vstack((C, np.zeros((1, C.shape[1]))))
C = np.hstack((C, np.zeros((C.shape[0], 1))))
Then somehow update it like:
for j in range(X.shape[0]):
C[i][j] = min(C[i][j], C[m][j])
C[j][i] = min(C[i][j], C[m][j])
I am not sure if this is right approach. Is there a simpler way to find the argmin, merge the rows and columns and update the values?
Upvotes: 1
Views: 1771
Reputation: 399
If you get confused when how to find row and column indexes of minimum dist error,
Firstly,
To avoid getting argmin twice due to symmetry you can construct your initial distance matrix in shape of lower triangle matrix.
def euclidean_distance(p1,p2):
return math.sqrt((p1[0]-p2[0])**2+(p1[1]-p2[1])**2)
distance_matrix = np.zeros((len(X.shape[0]),len(X.shape[0])))
for i in range(len(distance_matrix)):
for j in range(i):
distance_matrix[i][j] = euclidean_distance(X[i],X[j])
Secondly,
You can do your min search in the given matrix by hand if you don't like to use np tools or you are looking for a simple way.
min_value = np.inf
for i in range(len(distance_matrix)):
for j in range(i):
if( distance_matrix[i][j] < min_value):
min_value = distance_matrix[i][j]
min_i = i
min_j = j
Finally,
Update the distance matrix and merge clusters as fallows:
for i in range(len(distance_matrix)):
if( i > min_i and i < min_j ):
distance_matrix[i][min_i] = min(distance_matrix[i][min_i],distance_matrix[min_j][i])
elif( i > min_j ):
distance_matrix[i][min_i] = min(distance_matrix[i][min_i],distance_matrix[i][min_j])
for j in range(len(distance_matrix)):
if( j < min_i ):
distance_matrix[min_i][j] = min(distance_matrix[min_i][j],distance_matrix[min_j][j])
#remove one of the old clusters data from the distance matrix
distance_matrix = np.delete(distance_matrix, min_j, axis=1)
distance_matrix = np.delete(distance_matrix, min_j, axis=0)
A[min_i] = A[min_i] + A[min_j]
A.pop(min_j)
Upvotes: 1