Filtering cosine similarity scores into a pandas dataframe

Question

I'm trying to calculate cosine similarity scores between all possible combinations of text documents from a corpus. I'm using scikit-learn's cosine_similarity function to do this. Since my corpus is huge (30 million documents), the number of possible combinations between the documents in the corpus is just too many to store as a dataframe. So, I'd like to filter the similarity scores using a threshold, as they're being created, before storing them in a dataframe for future use. While I do that, I also want to assign the corresponding IDs of each of these documents to the index and column names of the dataframe. So, for a data value in the dataframe, each value should have index(row) and column names which are the document IDs for which the value is a cosine similarity score.

similarity_values = pd.DataFrame(cosine_similarity(tfidf_matrix), index = IDs, columns= IDs)

This piece of code works well without the filtering part. IDs is a list variable that has all document IDs sorted corresponding to the tfidf matrix.

similarity_values = pd.DataFrame(cosine_similarity(tfidf_matrix)>0.65, index = IDs, columns= IDs)

This modification helps with the filtering but the similarity scores are turned into boolean (True/False) values. How can I keep the actual cosine similarity scores here instead of the boolean True/False values.

Filtering cosine similarity scores into a pandas dataframe

Answers (1)

Related Questions