araspion
araspion

Reputation: 713

CountVectorizer in sklearn with only words above some minimum number of occurrences

I am using sklearn to train a logistic regression on some text data, by using CountVectorizer to tokenize the data into bigrams. I use a line of code like the one below:

vect= CountVectorizer(ngram_range=(1,2), binary =True)

However, I'd like to limit myself to only including bigrams in my resultant sparse matrix that occur more than some threshold number of times (e.g., 50) across all of my data. Is there some way to specify this or make it happen?

Upvotes: 2

Views: 4082

Answers (2)

Dhruv Ghulati
Dhruv Ghulati

Reputation: 3026

Use CountVectorizer(ngram_range=(1,2), binary =True, max_features = 5000) also to select the top 5000 occurring bigrams.

Upvotes: 1

araspion
araspion

Reputation: 713

It looks like this can be solved by using CountVectorizer's min_df argument:

vect= CountVectorizer(ngram_range=(1,2), binary =True, min_df = 500)

Upvotes: 4

Related Questions