Natalie Arellano
Natalie Arellano

Reputation: 132

Only ignore stop words for ngram_range=1

I am using CountVectorizer from sklearn...looking to provide a list of stop words and apply the count vectorizer for ngram_range of (1,3).

From what I can tell, if a word - say "me" - is in the list of stop words, then it doesn't get seen for higher ngrams i.e., "tell me" would not be a feature. Is there a way that I can specify something like, "consider stop words only when ngram is 1"?

Upvotes: 6

Views: 1106

Answers (1)

Nikita Astrakhantsev
Nikita Astrakhantsev

Reputation: 4749

You have at least 2 options:

  1. combine 2 kinds of features with FeatureUnion: one for ngram_range of (1,1) with stop words and one for ngram_range of (2,3) without stop words

  2. (more efficient, but harder to implement and use) implement your own analyzer that will check for presence in stop word list only for unigrams; see for example code sample in this answer.

Upvotes: 3

Related Questions