Reputation: 680
I have a dataframe that holds the Word Mover's Distance between each document in my dataframe. I am running kmediods on this to generate clusters.
1 2 3 4 5
1 0.00 0.05 0.07 0.04 0.05
2 0.05 0.00 0.06 0.04 0.05
3. 0.07 0.06 0.00 0.06 0.06
4 0.04 0.04. 0.06 0.00 0.04
5 0.05 0.05 0.06 0.04 0.00
kmed = KMedoids(n_clusters= 3, random_state=123, method ='pam').fit(distance)
After running on this initial matrix and generating clusters, I want to add new points to be clustered. After adding a new document to the distance matrix I end up with:
1 2 3 4 5 6
1 0.00 0.05 0.07 0.04 0.05 0.12
2 0.05 0.00 0.06 0.04 0.05 0.21
3. 0.07 0.06 0.00 0.06 0.06 0.01
4 0.04 0.04. 0.06 0.00 0.04 0.05
5 0.05 0.05 0.06 0.04 0.00 0.12
6. 0.12 0.21 0.01 0.05 0.12 0.00
I have tried using kmed.predict on the new row.
kmed.predict(new_distance.loc[-1: ])
However, this gives me an error of incompatible dimensions X.shape[1] == 6
while Y.shape[1] == 5
.
How can I use this distance of the new document to determine which cluster it should be a part of? Is this even possible, or do I have to recompute clusters every time? Thanks!
Upvotes: 4
Views: 614
Reputation: 1402
this is my trial, we must recompute the distance between the new point and the old ones each time.
import pandas as pd
from sklearn_extra.cluster import KMedoids
from sklearn.metrics import pairwise_distances
import numpy as np
# dummy data for trial
df = pd.DataFrame({0: [0,1],1 : [1,2]})
# calculatie distance
distance = pairwise_distances(df.values, df.values)
# fit model
kmed = KMedoids(n_clusters=2, random_state=123, method='pam').fit(distance)
new_point = [2,3]
distance = pairwise_distances(np.array(new_point).reshape(1, -1), df.values)
#calculate the distance between the new point and the initial dataset
print(distance)
#get ride of the last element which is the ditance of the new point with itself
print(kmed.predict(distance[0][:2].reshape(1, -1)))
Upvotes: 0
Reputation: 11424
The source code for k-medoids says the following:
def transform(self, X):
"""Transforms X to cluster-distance space.
Parameters
----------
X : {array-like, sparse matrix}, shape (n_query, n_features), \
or (n_query, n_indexed) if metric == 'precomputed'
Data to transform.
"""
I assume that you use the precomputed
metric (because you compute the distances outside the classifier), so in your case n_query
is the number of new documents, and n_indexed
is the number of the documents for which the fit
method was called.
In your particular case when you fit the model on 5 documents and then want to classify the 6'th one, the X
for classification should have shape (1,5)
, that can be computed as
kmed.predict(new_distance.loc[-1: , :-1])
Upvotes: 1