George B
George B

Reputation: 2712

Inverse Document Frequency Formula

I'm having trouble with manually calculating the values for tf-idf. Python scikit keeps spitting out different values than I'd expect.

I keep reading that

idf(term) =  log(# of docs/ # of docs with term)

If so, won't you get a divide by zero error if there are no docs with the term?

To solve that problem, I read that you do

log (# of docs / # of docs with term + 1 )

But then if the term is in every document, you get log (n/n+1) which is negative, which doesn't really make sense to me.

What am I not getting?

Upvotes: 3

Views: 1609

Answers (1)

Nikita Astrakhantsev
Nikita Astrakhantsev

Reputation: 4749

The trick you describe is actually called Laplace smoothing (or additive, or add-by-one smoothing) and suppose to add the same summand to the other part of the fraction - nominator in your case or denominator in original case.

In other words, you should add 1 to the total number of docs:

log (# of docs + 1 / # of docs with term + 1)

Btw, it is often better to use smaller summand, especially in case of small corpus:

log (# of docs + a / # of docs with term + a),

where a = 0.001 or something like that.

Upvotes: 3

Related Questions