nikosd
nikosd

Reputation: 949

How is term frequency calculated in scikit-learn CountVectorizer

I do not understand how CountVectorizer calculates the term frequency. I need to know this so that I can make a sensible choice for the max_df parameter when filtering out terms from a corpus. Here is example code:

    import pandas as pd
    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer

    vectorizer = CountVectorizer(min_df = 1, max_df = 0.9)
    X = vectorizer.fit_transform(['afr bdf dssd','afr bdf c','afr'])
    word_freq_df = pd.DataFrame({'term': vectorizer.get_feature_names(), 'occurrences':np.asarray(X.sum(axis=0)).ravel().tolist()})
    word_freq_df['frequency'] = word_freq_df['occurrences']/np.sum(word_freq_df['occurrences'])
    print word_freq_df.sort('occurrences',ascending = False).head()

       occurrences  term  frequency
    0            3   afr   0.500000
    1            2   bdf   0.333333
    2            1  dssd   0.166667

It seems that 'afr' appears in half of the terms in my corpus, as I expect by looking at the corpus. However, when I set max_df = 0.8 in CountVectorizer, the term 'afr' is filtered out of my corpus. Playing around, I find that with the coprus in my example, CountVectorizer assigns a frequency of ~0.833 to 'afr'. Could someone provide a formula on how the term frequency which enterts max_df is calculated?

Thanks

Upvotes: 3

Views: 6681

Answers (1)

BrenBarn
BrenBarn

Reputation: 251388

The issue is apparently not with how the frequency is calculated, but with how the max_df threshold is applied. The code for CountVectorizer does this:

max_doc_count = (max_df
    if isinstance(max_df, numbers.Integral)
    else int(round(max_df * n_doc))
)

That is, the maximum document count is obtained by rounding the document proportion times the number of documents. This means that, in a 3-document corpus, any max_df threshold which equates to more than 2.5 documents actually counts the same as a threshold of 3 documents. You are seeing a "frequency" of 2.5/3=0.8333 --- that is, a term that occurs in ~83.3% of 3 documents occurs in 2.5 of them, which is rounded up to 3, meaning it occurs in all of them.

In short, "afr" is correctly considered to have a document frequency of 3, but the maximum document frequency is incorrectly considered to be 3 (0.9*3=2.7, rounded up to 3).

I would consider this a bug in scikit. A maximum document frequency should round down, not up. If the threshold is 0.9, a term which occurs in all documents exceeds the threshold and should be excluded.

Upvotes: 6

Related Questions