Reputation: 121
I'm trying to use this fastclustering.py to do some clustering on textual data. my data is in a dataframe called df['processed_activities']. But I'm getting this error telling me it's a mismatch between 17 (the number of generated clusters) and 25006.
Using the following code:
from sentence_transformers import SentenceTransformer, util
import pandas as pd
import time
import numpy as np
import torch
# Define the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load a pre-trained sentence transformer model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
# Get the sentence embeddings for the activities column
sentences = df['processed_activities'].tolist()
embeddings = model.encode(sentences)
# Convert the embeddings numpy array to PyTorch tensor
embeddings = torch.from_numpy(embeddings).to(device)
print("Start clustering")
start_time = time.time()
#Two parameters to tune:
#min_cluster_size: Only consider cluster that have at least 25 elements
#threshold: Consider sentence pairs with a cosine-similarity larger than threshold as similar
labels = util.community_detection(embeddings, min_community_size=25, threshold=0.75)
print("Clustering done after {:.2f} sec".format(time.time() - start_time))
# Add the cluster labels to the dataframe
df['cluster'] = labels
# Print the clusters
num_clusters = np.max(labels) + 1
for i in range(num_clusters):
print(f"Cluster {i}:")
print(df.loc[df['cluster'] == i]['processed_activities'].values)
Generated error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_23/3969732060.py in <module>
29
30 # Add the cluster labels to the dataframe
---> 31 df['cluster'] = labels
32
33 # Print the clusters
/opt/conda/lib/python3.7/site-packages/pandas/core/common.py in require_length_match(data, index)
530 if len(data) != len(index):
531 raise ValueError(
--> 532 "Length of values "
533 f"({len(data)}) "
534 "does not match length of index "
ValueError: Length of values (17) does not match length of index (25006)
Upvotes: 0
Views: 119
Reputation: 538
Your DataFrame df has 25006 rows in it. You can check this by calling
print(len(df)) # prints 25006
This means that when you create a new column ("cluster"), the column also has 25006 values. When assigning an array of length 17, this gives a mismatch.
Just save the resulting array to a new variable, not in the source df. If you really want to put it into the df, check the pandas docs to resolve the shape mismatch.
Upvotes: 0