Minu
Minu

Reputation: 480

Filtering cosine similarity scores into a pandas dataframe

I'm trying to calculate cosine similarity scores between all possible combinations of text documents from a corpus. I'm using scikit-learn's cosine_similarity function to do this. Since my corpus is huge (30 million documents), the number of possible combinations between the documents in the corpus is just too many to store as a dataframe. So, I'd like to filter the similarity scores using a threshold, as they're being created, before storing them in a dataframe for future use. While I do that, I also want to assign the corresponding IDs of each of these documents to the index and column names of the dataframe. So, for a data value in the dataframe, each value should have index(row) and column names which are the document IDs for which the value is a cosine similarity score.

similarity_values = pd.DataFrame(cosine_similarity(tfidf_matrix), index = IDs, columns= IDs)

This piece of code works well without the filtering part. IDs is a list variable that has all document IDs sorted corresponding to the tfidf matrix.

similarity_values = pd.DataFrame(cosine_similarity(tfidf_matrix)>0.65, index = IDs, columns= IDs)

This modification helps with the filtering but the similarity scores are turned into boolean (True/False) values. How can I keep the actual cosine similarity scores here instead of the boolean True/False values.

Upvotes: 3

Views: 1959

Answers (1)

Ahmed Elsafty
Ahmed Elsafty

Reputation: 579

We can break down the cosine similarity to batches. For example, you're using cosine_similarity(tfidf_matrix) to generate an NxN matrix, but we can also use cosine_similarity(tfidf_matrix[:m], tfidf_matrix) to generate a mxN matrix. Then we combine all mxN matrices to construct the final NxN matrix. In this case we can do the following based on your followup clarification:

# source: followup clarification #2 
def question_followup_transformer(df):
  return df.stack().reset_index().rename(columns={'level_0':'ID1','level_1':'ID2', 0:'Score'})
# Copyright 2024 Google LLC.
# SPDX-License-Identifier: Apache-2.0
from sklearn.feature_extraction.text import TfidfVectorizer

# corpus is not provided in the example. 
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

matrix_length = tfidf_matrix.shape[0]

BATCH_SIZE = 10
FILTER_THRESHOLD = 0.6

df = []
# we iterate (matrix_length//batch_size) times.
for i in range(0, matrix_length - BATCH_SIZE, BATCH_SIZE):
  # compute cosine similarity for a subMatrix.
  subMatrix = cosine_similarity(tfidf_matrix[i:i+BATCH_SIZE], tfidf_matrix)

  # set the proper index of the submatrix in a dataframe.
  similarity_values = pd.DataFrame(
      subMatrix, 
      index = range(i, i+BATCH_SIZE), 
      columns= range(0, matrix_length))
  
  # apply the stack transformation from the followup clarification.
  stacked_df = question_followup_transformer(similarity_values)

  # filter all scores below the the filter threshold.
  filtered_df = stacked_df.query("Score > {}".format(FILTER_THRESHOLD))

  # append dataframe to a list.
  df.append(filtered_df)

# concat all dataframes to a single one. 
df = pd.concat(df, ignore_index=True)

Upvotes: 0

Related Questions