Reputation: 713
I am using sklearn to train a logistic regression on some text data, by using CountVectorizer to tokenize the data into bigrams. I use a line of code like the one below:
vect= CountVectorizer(ngram_range=(1,2), binary =True)
However, I'd like to limit myself to only including bigrams in my resultant sparse matrix that occur more than some threshold number of times (e.g., 50) across all of my data. Is there some way to specify this or make it happen?
Upvotes: 2
Views: 4082
Reputation: 3026
Use CountVectorizer(ngram_range=(1,2), binary =True, max_features = 5000)
also to select the top 5000 occurring bigrams.
Upvotes: 1
Reputation: 713
It looks like this can be solved by using CountVectorizer's min_df argument:
vect= CountVectorizer(ngram_range=(1,2), binary =True, min_df = 500)
Upvotes: 4