geo
geo

Reputation: 31

python - How do I calculate the similarity between pairs of documents and queries?

I have a very large dataset which is essentially document - search query pairs and I want to calculate the similarity for each pair. I've calculated the TF-IDF for each of the documents and queries. I realize that given two vectors you can calculate the similarity using linear_kernel. However, I'm not sure how to do this on a very large set of data (i.e. no for loops).

Here is what I have so far:

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

df_train = pd.read_csv('train.csv')

vectorizer = TfidfVectorizer()
doc_tfidf = vectorizer.fit_transform(df_train["document"])
query_tfidf = vectorizer.transform(df_train["query"])

linear_kernel(doc_tfidf, query_tfidf)

Now this gives me an NxN matrix, where N is the number of document-query pairs I have. What I am looking for is N-size vector with a single value per document-query pair.

I realize I could do this with a for loop, but with a dataset of about 500K pairs this would not work. Is there some way that I could vectorize this calculation?

UPDATE: So I think I have a solution that works and seems to be fast. In the code above I replace:

linear_kernel(doc_tfidf, query_tfidf)

with

df_train['similarity'] = desc_tfidf.multiply(query_tfidf).sum(axis=1)

Does this seem like a sane approach? Is there a better way to do this?

Upvotes: 3

Views: 1209

Answers (1)

rth
rth

Reputation: 11201

Cosine similarity is typically used to compute the similarity between text documents, which in scikit-learn is implemented in sklearn.metrics.pairwise.cosine_similarity.

However, because TfidfVectorizer also performs a L2 normalization of the results by default (i.e. norm='l2'), in this case it is sufficient to compute the dot product to get the cosine similarity.

In your example, you should therefore use,

similarity = doc_tfidf.dot(query_tfidf.T).T

instead of an element-wise multiplication.

Upvotes: 1

Related Questions