Reputation: 2712
I'm having trouble with manually calculating the values for tf-idf. Python scikit keeps spitting out different values than I'd expect.
I keep reading that
idf(term) = log(# of docs/ # of docs with term)
If so, won't you get a divide by zero error if there are no docs with the term?
To solve that problem, I read that you do
log (# of docs / # of docs with term + 1 )
But then if the term is in every document, you get log (n/n+1) which is negative, which doesn't really make sense to me.
What am I not getting?
Upvotes: 3
Views: 1609
Reputation: 4749
The trick you describe is actually called Laplace smoothing (or additive, or add-by-one smoothing) and suppose to add the same summand to the other part of the fraction - nominator in your case or denominator in original case.
In other words, you should add 1 to the total number of docs:
log (# of docs + 1 / # of docs with term + 1)
Btw, it is often better to use smaller summand, especially in case of small corpus:
log (# of docs + a / # of docs with term + a)
,
where a = 0.001 or something like that.
Upvotes: 3