Text (cosine) similarity

Question

I have followed the explanation of Fred Foo in this stack overflow question: How to compute the similarity between two text documents?

I have run the following piece of code that he wrote:

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["I'd like an apple",
          "An apple a day keeps the doctor away",
          "Never compare an apple to an orange",
          "I prefer scikit-learn to Orange",
          "The scikit-learn docs are Orange and Blue"]
vect = TfidfVectorizer(min_df=1, stop_words="english")
tfidf = vect.fit_transform(corpus)
pairwise_similarity = tfidf * tfidf.T
print(pairwise_similarity.toarray())

And the result is:

[[1.         0.17668795 0.27056873 0.         0.        ]
 [0.17668795 1.         0.15439436 0.         0.        ]
 [0.27056873 0.15439436 1.         0.19635649 0.16815247]
 [0.         0.         0.19635649 1.         0.54499756]
 [0.         0.         0.16815247 0.54499756 1.        ]]

But what I noticed is that when I set corpus to be:

corpus = ["I'd like an apple",
          "An apple a day keeps the doctor away"]

and run the same code again, I get the matrix:

[[1.         0.19431434]
 [0.19431434 1.        ]]

Thus their similarity changes (in the first matrix, their similarity is 0.17668795). Why is that the case? I am really confused. Thank you in advance!

furas · Accepted Answer

In Wikipedia you can see how to calculate Tf-idf

N - number of documents in corpus.

So similarity depends on number of all documents/sentences in corpus.

If you have more documents/sentences then it changes results.

If you add the same document/sentence few times then it also changes results.

Text (cosine) similarity

Answers (1)

Related Questions