Reputation: 132
I am using CountVectorizer from sklearn...looking to provide a list of stop words and apply the count vectorizer for ngram_range of (1,3).
From what I can tell, if a word - say "me" - is in the list of stop words, then it doesn't get seen for higher ngrams i.e., "tell me" would not be a feature. Is there a way that I can specify something like, "consider stop words only when ngram is 1"?
Upvotes: 6
Views: 1106
Reputation: 4749
You have at least 2 options:
combine 2 kinds of features with FeatureUnion: one for ngram_range of (1,1) with stop words and one for ngram_range of (2,3) without stop words
(more efficient, but harder to implement and use) implement your own analyzer that will check for presence in stop word list only for unigrams; see for example code sample in this answer.
Upvotes: 3