Lay András
Lay András

Reputation: 855

Compare two documents

There is a big dictionary, the vocabulary of which these documents are composed, with strictly one word from it in each document. I would like to compare these documents with each other, calculate a value, under which I would declare the two documents very different, and above which, very similar.

If a word is included in both documents, but in other documents rarely or not at all, it reinforces the similarity of the two documents, because it is a unique word that occurs only in these two.

If a word is included in both documents, but often also in other documents, this will weaken the similarity of the two documents, because it is a common word that will not make them similar.

Which method should I use? TF-IDF? Other?

Upvotes: 0

Views: 248

Answers (1)

gil.fernandes
gil.fernandes

Reputation: 14611

TF-IDF is a good start for sure.

You could improve it though by also considering the text length of the document. This is what the library Lucene does.

Lucene extended the TF-IDF formula by considering the length of the document, because this corresponds more to human intuition. After all, if you find the word "cat" in a document with one word, this term will be more relevant when compared to a single "cat" of a document with thousand words.

It seems that Lucene adopted an extended formula for TF-IDF:

log(numDocs / (docFreq + 1)) * sqrt(tf) * (1/sqrt(length))

numDocs = total number of documents
docFreq = in how many documents the word was found
tf      = Term frequency in a specific document
length  = How many words there are in the document

Nowadays it seems though that Lucene has evolved to use another algorithm called BM25 ("Best Match 25"). Overall it seems that this algorithm produces better results than TF-IDF. It seems that the formula for BM25 used in Lucene is:

IDF * ((k + 1) * tf) / (k * (1.0 - b + b * (|d|/avgDl)) + tf)

k = constant (typically 1.2)
tf = term frequency
b = also a constant which tunes the influence of the document length
|d| = document length
avgDl = average document length
IDF = log ( numDocs / docFreq + 1) + 1

For more details on the Lucene implementation check this great blog article.

Upvotes: 1

Related Questions