How can I improve my Shared Nearest Neighbor clustering algorithm?

Question

I wrote my own Shared Nearest Neighbor(SNN) clustering algorithm, according to the original paper. Essentially, I get the nearest neighbors for each data point, precompute the distance matrix with Jaccard distance, and pass the distance matrix to DBSCAN.

To accelerate the algorithm, I only compute the Jaccard distance between two data points if they are nearest neighbors of each other and have over a certain number of shared neighbors. I also take advantage of the symmetry of the distance matrix, as I only compute half the matrix.

However, my algorithm is slow and takes much longer than common clustering algorithms, such as K-Means or DBSCAN. Can someone look at my codes and suggest how I can improve my codes and make the algorithm faster?

def jaccard(a,b):

    """
    Computes the Jaccard distance between two arrays.

    Parameters
    ----------

    a: an array.

    b: an array.

    """

    A = np.array(a, dtype='int')
    B = np.array(b, dtype='int')
    A = A[np.where(A > -1)[0]]
    B = B[np.where(B > -1)[0]]
    union = np.union1d(A,B)
    intersection = np.intersect1d(A,B)
    return 1.0 - len(intersection)*1.0 / len(union)

def iterator_dist(indices, k_min=5):
    """
    An iterator that computes the Jaccard distance for any pair of stars. 

    Parameters: 

    indices: the indices of nearest neighbors in the chemistry-velocity 
    space. 

    """

    for n in range(len(indices)):
        for m in indices[n][indices[n] > n]:
            if len(np.intersect1d(indices[n], indices[m])) > k_min:
                dist = jaccard(indices[n], indices[m])
                yield (n, m, dist)

# load data here
data = 
# hyperparameters
n_neighbors = 
eps = 
min_samples = 
k_min = 
# K Nearest Neighbors
nbrs = NearestNeighbors(n_neighbors=n_neighbors).fit(data)
distances, indices = nbrs.kneighbors()
# distance matrix
S = lil_matrix((len(distances), len(distances)))
for (n, m, dist) in iterator_dist(indices, k_min):
    S[n,m] = dist
    S[m,n] = dist
db = DBSCAN(eps=eps, min_samples=min_samples, metric='precomputed', 
     n_jobs=-1).fit(S)
labels = db.labels_

How can I improve my Shared Nearest Neighbor clustering algorithm?

Answers (1)

Related Questions