Reputation: 1037
Consider I have to apply CountVectorizer() on the following data:
words = [
'A am is',
'This the a',
'the am is',
'this a am',
]
I did the following:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())
It returns the following:
[[1 1 0 0]
[0 0 1 1]
[1 1 1 0]
[1 0 0 1]]
For reference print(vectorizer.get_feature_names())
prints ['am', 'is', 'the', 'this']
Why is 'a' not being read??
Is it that single letter words don't count as words in CountVectorizer()
Upvotes: 2
Views: 1556
Reputation: 16866
Check the doc
token_pattern
Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
All the single character tokens are ignored by the default tokenizer. That is the reason why a
is missing.
If you want single character tokens to be in the vocabulary, then you have to use a costume tokenizer.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(tokenizer=lambda txt: txt.split())
X = vectorizer.fit_transform(words)
print (vectorizer.get_feature_names())
Output:
['a', 'am', 'is', 'the', 'this']
Upvotes: 4