Alex
Alex

Reputation: 4264

CountVectorizer ignoring 'I'

Why is CountVectorizer in sklearn ignoring the pronoun "I"?

ngram_vectorizer = CountVectorizer(analyzer = "word", ngram_range = (2,2), min_df = 1)
ngram_vectorizer.fit_transform(['HE GAVE IT TO I'])
<1x3 sparse matrix of type '<class 'numpy.int64'>'
ngram_vectorizer.get_feature_names()
['gave it', 'he gave', 'it to']

Upvotes: 10

Views: 2034

Answers (1)

ldirer
ldirer

Reputation: 6756

The default tokenizer considers only 2-character (or more) words.

You can change this behaviour by passing an appropriate token_pattern to your CountVectorizer.

The default pattern is (see the signature in the docs):

'token_pattern': u'(?u)\\b\\w\\w+\\b'

You can get a CountVectorizer that does not drop one-letter words by changing the default, for instance:

from sklearn.feature_extraction.text import CountVectorizer
ngram_vectorizer = CountVectorizer(analyzer="word", ngram_range=(2,2), 
                                   token_pattern=u"(?u)\\b\\w+\\b",min_df=1)
ngram_vectorizer.fit_transform(['HE GAVE IT TO I'])
print(ngram_vectorizer.get_feature_names())

Which gives:

['gave it', 'he gave', 'it to', 'to i']

Upvotes: 12

Related Questions