CountVectorizer() not working with single letter word

Question

Consider I have to apply CountVectorizer() on the following data:

words = [
     'A am is',
     'This the a',
     'the am is',
     'this a am',
]

I did the following:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())

It returns the following:

[[1 1 0 0]
 [0 0 1 1]
 [1 1 1 0]
 [1 0 0 1]]

For reference print(vectorizer.get_feature_names()) prints ['am', 'is', 'the', 'this']

Why is 'a' not being read??
Is it that single letter words don't count as words in CountVectorizer()

mujjiga · Accepted Answer

Check the doc

token_pattern

Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

All the single character tokens are ignored by the default tokenizer. That is the reason why a is missing.

If you want single character tokens to be in the vocabulary, then you have to use a costume tokenizer.

Sample Code

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(tokenizer=lambda txt: txt.split())
X = vectorizer.fit_transform(words)
print (vectorizer.get_feature_names())

Output:

['a', 'am', 'is', 'the', 'this']

CountVectorizer() not working with single letter word

Answers (1)

Sample Code

Related Questions