How to assign new observations to cluster using distance matrix and kmedoids?

Question

I have a dataframe that holds the Word Mover's Distance between each document in my dataframe. I am running kmediods on this to generate clusters.

       1      2     3      4      5   
  1  0.00   0.05  0.07   0.04   0.05
  2  0.05   0.00  0.06   0.04   0.05
  3. 0.07   0.06  0.00   0.06   0.06
  4  0.04   0.04. 0.06   0.00   0.04
  5  0.05   0.05  0.06   0.04   0.00

  kmed = KMedoids(n_clusters= 3, random_state=123, method  ='pam').fit(distance)

After running on this initial matrix and generating clusters, I want to add new points to be clustered. After adding a new document to the distance matrix I end up with:

       1      2     3      4      5      6
  1  0.00   0.05  0.07   0.04   0.05   0.12
  2  0.05   0.00  0.06   0.04   0.05   0.21 
  3. 0.07   0.06  0.00   0.06   0.06   0.01
  4  0.04   0.04. 0.06   0.00   0.04   0.05
  5  0.05   0.05  0.06   0.04   0.00   0.12
  6. 0.12   0.21  0.01   0.05   0.12   0.00

I have tried using kmed.predict on the new row.

kmed.predict(new_distance.loc[-1: ])

However, this gives me an error of incompatible dimensions X.shape[1] == 6 while Y.shape[1] == 5.

How can I use this distance of the new document to determine which cluster it should be a part of? Is this even possible, or do I have to recompute clusters every time? Thanks!

David Dale · Accepted Answer

The source code for k-medoids says the following:

def transform(self, X):
    """Transforms X to cluster-distance space.

    Parameters
    ----------
    X : {array-like, sparse matrix}, shape (n_query, n_features), \
            or (n_query, n_indexed) if metric == 'precomputed'
        Data to transform.
   """

I assume that you use the precomputed metric (because you compute the distances outside the classifier), so in your case n_query is the number of new documents, and n_indexed is the number of the documents for which the fit method was called.

In your particular case when you fit the model on 5 documents and then want to classify the 6'th one, the X for classification should have shape (1,5), that can be computed as

kmed.predict(new_distance.loc[-1: , :-1])

How to assign new observations to cluster using distance matrix and kmedoids?

Answers (2)

Related Questions