Reputation: 39
I'm not sure when creating an instance of the CountVectorizer class, what is the difference between
vectorizer = CountVectorizer(tokenizer=word_tokenize)
and vectorizer = CountVectorizer
Please help me make it clear. Thank you for your time.
Upvotes: 1
Views: 293
Reputation: 36724
By default, CountVectorizer
does not tokenize the input. Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. In other words, it turns a long string like 'This is the input'
into a sequence:
['This', 'is', 'the', 'input']
If you specify the tokenizer
argument with a callable
in CountVectorizer
, it will use this function to tokenize the input (source).
Upvotes: 1