Reputation: 949
I do not understand how CountVectorizer calculates the term frequency. I need to know this so that I can make a sensible choice for the max_df
parameter when filtering out terms from a corpus. Here is example code:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df = 1, max_df = 0.9)
X = vectorizer.fit_transform(['afr bdf dssd','afr bdf c','afr'])
word_freq_df = pd.DataFrame({'term': vectorizer.get_feature_names(), 'occurrences':np.asarray(X.sum(axis=0)).ravel().tolist()})
word_freq_df['frequency'] = word_freq_df['occurrences']/np.sum(word_freq_df['occurrences'])
print word_freq_df.sort('occurrences',ascending = False).head()
occurrences term frequency
0 3 afr 0.500000
1 2 bdf 0.333333
2 1 dssd 0.166667
It seems that 'afr' appears in half of the terms in my corpus, as I expect by looking at the corpus. However, when I set max_df = 0.8
in CountVectorizer
, the term 'afr' is filtered out of my corpus. Playing around, I find that with the coprus in my example, CountVectorizer assigns a frequency of ~0.833 to 'afr'. Could someone provide a formula on how the term frequency which enterts max_df
is calculated?
Thanks
Upvotes: 3
Views: 6681
Reputation: 251388
The issue is apparently not with how the frequency is calculated, but with how the max_df
threshold is applied. The code for CountVectorizer
does this:
max_doc_count = (max_df
if isinstance(max_df, numbers.Integral)
else int(round(max_df * n_doc))
)
That is, the maximum document count is obtained by rounding the document proportion times the number of documents. This means that, in a 3-document corpus, any max_df
threshold which equates to more than 2.5 documents actually counts the same as a threshold of 3 documents. You are seeing a "frequency" of 2.5/3=0.8333 --- that is, a term that occurs in ~83.3% of 3 documents occurs in 2.5 of them, which is rounded up to 3, meaning it occurs in all of them.
In short, "afr" is correctly considered to have a document frequency of 3, but the maximum document frequency is incorrectly considered to be 3 (0.9*3=2.7, rounded up to 3).
I would consider this a bug in scikit. A maximum document frequency should round down, not up. If the threshold is 0.9, a term which occurs in all documents exceeds the threshold and should be excluded.
Upvotes: 6