Sklearn Agglomerative Clustering Custom Affinity

I'm trying to use agglomerative clustering with a custom distance metric (ie affinity) since I'd like to cluster a sequence of integers by sequence similarity and not something like the euclidean distance which isn't meaningful.

My data looks something like this

>> dat.values 

array([[860, 261, 240, ..., 300, 241,   1],
   [860, 840, 860, ..., 860, 240,   1],
   [260, 860, 260, ..., 260, 220,   1],
   ...,
   [260, 260, 260, ..., 260, 260,   1],
   [260, 860, 260, ..., 840, 860,   1],
   [280, 240, 241, ..., 240, 260,   1]]) 

I've created the following similarity function

def sim(x, y): 
    return np.sum(np.equal(np.array(x), np.array(y)))/len(x)

So I just return the % matching values in the two sequences with numpy and make the following call

cluster = AgglomerativeClustering(n_clusters=5, affinity=sim, linkage='average')
cluster.fit(dat.values)

But I'm getting an error saying

TypeError: sim() missing 1 required positional argument: 'y'

I'm not sure why I'm getting this error; I thought the function will cluster pairs of rows so each required argument would be passed.

Any help with this would be greatly appreciated

Upvotes: 11

Views: 9097

Answers (2)

Vivek Kumar
Vivek Kumar

Reputation: 36619

'affinity' as a callable requires a single input X (which is your feature or observation matrix) and then call the distances between all the points (samples) inside it.

So you need to modify your method as:

# Your method to calculate distance between two samples
def sim(x, y): 
    return np.sum(np.equal(np.array(x), np.array(y)))/len(x)


# Method to calculate distances between all sample pairs
from sklearn.metrics import pairwise_distances
def sim_affinity(X):
    return pairwise_distances(X, metric=sim)

cluster = AgglomerativeClustering(n_clusters=5, affinity=sim_affinity, linkage='average')
cluster.fit(X)

Or you can use affinity='precomputed' as @avchauzov has suggested. For that you will have to pass the pre-calculated distance matrix for your observations in fit(). Something like:

cluster = AgglomerativeClustering(n_clusters=5, affinity='precomputed', linkage='average')
distance_matrix = sim_affinity(X)
cluster.fit(distance_matrix)

Note: You have specified similarity in place of distance. So make sure you understand how the clustering will work here. Or maybe tweak your similarity function to return distance. Something like:

def sim(x, y): 
    # Subtracted from 1.0 (highest similarity), so now it represents distance
    return 1.0 - np.sum(np.equal(np.array(x), np.array(y)))/len(x)

Upvotes: 20

andrewchauzov
andrewchauzov

Reputation: 1019

The common way to do it is to put affinity='precomputed and fit the distance matrix (see example here: https://gist.github.com/codehacken/8b9316e025beeabb082dda4d0654a6fa)

UPD In sklearn.hierarchical.py (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/hierarchical.py#L460) you can see that your custom affinity has to get only X (not y) as the input. And the input should be the linkage_tree. So, you need to rewrite your sim() function.

But in my opinion the first way is much more convenient.

Upvotes: 2

Related Questions