CountVectorizer in sklearn with only words above some minimum number of occurrences

Question

I am using sklearn to train a logistic regression on some text data, by using CountVectorizer to tokenize the data into bigrams. I use a line of code like the one below:

vect= CountVectorizer(ngram_range=(1,2), binary =True)

However, I'd like to limit myself to only including bigrams in my resultant sparse matrix that occur more than some threshold number of times (e.g., 50) across all of my data. Is there some way to specify this or make it happen?

araspion · Accepted Answer

It looks like this can be solved by using CountVectorizer's min_df argument:

vect= CountVectorizer(ngram_range=(1,2), binary =True, min_df = 500)

CountVectorizer in sklearn with only words above some minimum number of occurrences

Answers (2)

Related Questions