Only ignore stop words for ngram_range=1

Question

I am using CountVectorizer from sklearn...looking to provide a list of stop words and apply the count vectorizer for ngram_range of (1,3).

From what I can tell, if a word - say "me" - is in the list of stop words, then it doesn't get seen for higher ngrams i.e., "tell me" would not be a feature. Is there a way that I can specify something like, "consider stop words only when ngram is 1"?

Nikita Astrakhantsev · Accepted Answer

You have at least 2 options:

combine 2 kinds of features with FeatureUnion: one for ngram_range of (1,1) with stop words and one for ngram_range of (2,3) without stop words
(more efficient, but harder to implement and use) implement your own analyzer that will check for presence in stop word list only for unigrams; see for example code sample in this answer.

Only ignore stop words for ngram_range=1

Answers (1)

Related Questions