Reputation: 365
I have followed the explanation of Fred Foo in this stack overflow question: How to compute the similarity between two text documents?
I have run the following piece of code that he wrote:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["I'd like an apple",
"An apple a day keeps the doctor away",
"Never compare an apple to an orange",
"I prefer scikit-learn to Orange",
"The scikit-learn docs are Orange and Blue"]
vect = TfidfVectorizer(min_df=1, stop_words="english")
tfidf = vect.fit_transform(corpus)
pairwise_similarity = tfidf * tfidf.T
print(pairwise_similarity.toarray())
And the result is:
[[1. 0.17668795 0.27056873 0. 0. ]
[0.17668795 1. 0.15439436 0. 0. ]
[0.27056873 0.15439436 1. 0.19635649 0.16815247]
[0. 0. 0.19635649 1. 0.54499756]
[0. 0. 0.16815247 0.54499756 1. ]]
But what I noticed is that when I set corpus to be:
corpus = ["I'd like an apple",
"An apple a day keeps the doctor away"]
and run the same code again, I get the matrix:
[[1. 0.19431434]
[0.19431434 1. ]]
Thus their similarity changes (in the first matrix, their similarity is 0.17668795). Why is that the case? I am really confused. Thank you in advance!
Upvotes: 0
Views: 379
Reputation: 143057
In Wikipedia you can see how to calculate Tf-idf
N
- number of documents in corpus.
So similarity depends on number of all documents/sentences in corpus.
If you have more documents/sentences then it changes results.
If you add the same document/sentence few times then it also changes results.
Upvotes: 1