Reputation: 294
My goal is to input 3 queries and find out which query is most similar to a set of 5 documents.
So far I have calculated the tf-idf
of the documents doing the following:
from sklearn.feature_extraction.text import TfidfVectorizer
def get_term_frequency_inverse_data_frequency(documents):
allDocs = []
for document in documents:
allDocs.append(nlp.clean_tf_idf_text(document))
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(allDocs)
return matrix
def get_tf_idf_query_similarity(documents, query):
tfidf = get_term_frequency_inverse_data_frequency(documents)
The problem I am having is now that I have tf-idf
of the documents what operations do I perform on the query so I can find the cosine similarity to the documents?
Upvotes: 11
Views: 13515
Reputation: 16966
Here is my suggestion:
TfidfVectorizer
directly using preprocessing
attribute. from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
vectorizer = TfidfVectorizer(preprocessor=nlp.clean_tf_idf_text)
docs_tfidf = vectorizer.fit_transform(allDocs)
def get_tf_idf_query_similarity(vectorizer, docs_tfidf, query):
"""
vectorizer: TfIdfVectorizer model
docs_tfidf: tfidf vectors for all docs
query: query doc
return: cosine similarity between query and all docs
"""
query_tfidf = vectorizer.transform([query])
cosineSimilarities = cosine_similarity(query_tfidf, docs_tfidf).flatten()
return cosineSimilarities
Upvotes: 12
Reputation: 294
The other answers were very helpful but not entirely what I was looking for as they didn't help me transform my query so I could compare it with the documents.
To transform the query I first fit it to the document matrix:
queryTFIDF = TfidfVectorizer().fit(allDocs)
I then transform it into the matrix shape:
queryTFIDF = queryTFIDF.transform([query])
And then just calculate the cosine similarity between all the documents and my query using the sklearn.metrics.pairwise.cosine_similarity function
cosineSimilarities = cosine_similarity(queryTFIDF, docTFIDF).flatten()
Although I realise using Nihal's solution I could input my query as one of the documents and then calculated the similarity between it and the other documents but this is what worked best for me.
The full code ends up looking like:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def get_tf_idf_query_similarity(documents, query):
allDocs = []
for document in documents:
allDocs.append(nlp.clean_tf_idf_text(document))
docTFIDF = TfidfVectorizer().fit_transform(allDocs)
queryTFIDF = TfidfVectorizer().fit(allDocs)
queryTFIDF = queryTFIDF.transform([query])
cosineSimilarities = cosine_similarity(queryTFIDF, docTFIDF).flatten()
return cosineSimilarities
Upvotes: 3
Reputation: 303
You can do as Nihal has written in his response or you can use the nearest neighbors algo from sklearn. You have to select the proper metric (cosine)
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=5, metric='cosine')
Upvotes: 3
Reputation: 5515
Cosine similarity is cosine of the angle between the vectors that represent documents.
K(X, Y) = <X, Y> / (||X||*||Y||)
Your tf-idf matrix will be a sparse matrix with dimensions = no. of documents * no. of distinct words.
To print the whole matrix you can use todense()
print(tfidf.todense())
Each row represents the vector representation corresponding to one document. Like wise each column corresponds to tf-idf score of unique word in the corpus.
Between a vector and any other vector the pairwise-similarity can be calculated from your tf-idf matrix as:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(reference_vector, tfidf_matrix)
The output will be a array of length = no. of documents indicating the similarity score between your reference vector and vector corresponding to each document. Of course the similarity between the reference vector and itself will be 1. Overall it will be a value between 0 and 1.
To find the similarity between first and second documents,
print(cosine_similarity(tfidf_matrix[0], tfidf_matrix[1]))
array([[0.36651513]])
Upvotes: 1