Moe_blg
Moe_blg

Reputation: 121

Clustering generates length mismatch with original data

I'm trying to use this fastclustering.py to do some clustering on textual data. my data is in a dataframe called df['processed_activities']. But I'm getting this error telling me it's a mismatch between 17 (the number of generated clusters) and 25006.

Using the following code:

from sentence_transformers import SentenceTransformer, util
import pandas as pd
import time
import numpy as np
import torch

# Define the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load a pre-trained sentence transformer model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Get the sentence embeddings for the activities column
sentences = df['processed_activities'].tolist()
embeddings = model.encode(sentences)

# Convert the embeddings numpy array to PyTorch tensor
embeddings = torch.from_numpy(embeddings).to(device)

print("Start clustering")
start_time = time.time()

#Two parameters to tune:
#min_cluster_size: Only consider cluster that have at least 25 elements
#threshold: Consider sentence pairs with a cosine-similarity larger than threshold as similar
labels = util.community_detection(embeddings, min_community_size=25, threshold=0.75)

print("Clustering done after {:.2f} sec".format(time.time() - start_time))

# Add the cluster labels to the dataframe
df['cluster'] = labels

# Print the clusters
num_clusters = np.max(labels) + 1
for i in range(num_clusters):
    print(f"Cluster {i}:")
    print(df.loc[df['cluster'] == i]['processed_activities'].values)

Generated error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_23/3969732060.py in <module>
     29 
     30 # Add the cluster labels to the dataframe
---> 31 df['cluster'] = labels
     32 
     33 # Print the clusters


/opt/conda/lib/python3.7/site-packages/pandas/core/common.py in require_length_match(data, index)
    530     if len(data) != len(index):
    531         raise ValueError(
--> 532             "Length of values "
    533             f"({len(data)}) "
    534             "does not match length of index "

ValueError: Length of values (17) does not match length of index (25006)

Upvotes: 0

Views: 119

Answers (1)

Quantum
Quantum

Reputation: 538

Your DataFrame df has 25006 rows in it. You can check this by calling

print(len(df)) # prints 25006

This means that when you create a new column ("cluster"), the column also has 25006 values. When assigning an array of length 17, this gives a mismatch.

Just save the resulting array to a new variable, not in the source df. If you really want to put it into the df, check the pandas docs to resolve the shape mismatch.

Upvotes: 0

Related Questions