Reputation: 31
I have a very large dataset which is essentially document - search query pairs and I want to calculate the similarity for each pair. I've calculated the TF-IDF for each of the documents and queries. I realize that given two vectors you can calculate the similarity using linear_kernel. However, I'm not sure how to do this on a very large set of data (i.e. no for loops).
Here is what I have so far:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
df_train = pd.read_csv('train.csv')
vectorizer = TfidfVectorizer()
doc_tfidf = vectorizer.fit_transform(df_train["document"])
query_tfidf = vectorizer.transform(df_train["query"])
linear_kernel(doc_tfidf, query_tfidf)
Now this gives me an NxN matrix, where N is the number of document-query pairs I have. What I am looking for is N-size vector with a single value per document-query pair.
I realize I could do this with a for loop, but with a dataset of about 500K pairs this would not work. Is there some way that I could vectorize this calculation?
UPDATE: So I think I have a solution that works and seems to be fast. In the code above I replace:
linear_kernel(doc_tfidf, query_tfidf)
with
df_train['similarity'] = desc_tfidf.multiply(query_tfidf).sum(axis=1)
Does this seem like a sane approach? Is there a better way to do this?
Upvotes: 3
Views: 1209
Reputation: 11201
Cosine similarity is typically used to compute the similarity between text documents, which in scikit-learn is implemented in sklearn.metrics.pairwise.cosine_similarity
.
However, because TfidfVectorizer
also performs a L2 normalization of the results by default (i.e. norm='l2'
), in this case it is sufficient to compute the dot product to get the cosine similarity.
In your example, you should therefore use,
similarity = doc_tfidf.dot(query_tfidf.T).T
instead of an element-wise multiplication.
Upvotes: 1