python - How do I calculate the similarity between pairs of documents and queries?

Question

I have a very large dataset which is essentially document - search query pairs and I want to calculate the similarity for each pair. I've calculated the TF-IDF for each of the documents and queries. I realize that given two vectors you can calculate the similarity using linear_kernel. However, I'm not sure how to do this on a very large set of data (i.e. no for loops).

Here is what I have so far:

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

df_train = pd.read_csv('train.csv')

vectorizer = TfidfVectorizer()
doc_tfidf = vectorizer.fit_transform(df_train["document"])
query_tfidf = vectorizer.transform(df_train["query"])

linear_kernel(doc_tfidf, query_tfidf)

Now this gives me an NxN matrix, where N is the number of document-query pairs I have. What I am looking for is N-size vector with a single value per document-query pair.

I realize I could do this with a for loop, but with a dataset of about 500K pairs this would not work. Is there some way that I could vectorize this calculation?

UPDATE: So I think I have a solution that works and seems to be fast. In the code above I replace:

linear_kernel(doc_tfidf, query_tfidf)

with

df_train['similarity'] = desc_tfidf.multiply(query_tfidf).sum(axis=1)

Does this seem like a sane approach? Is there a better way to do this?

python - How do I calculate the similarity between pairs of documents and queries?

Answers (1)

Related Questions