Tran Tran
Tran Tran

Reputation: 39

CountVectorizer in Scikit Learn

I'm not sure when creating an instance of the CountVectorizer class, what is the difference between vectorizer = CountVectorizer(tokenizer=word_tokenize) and vectorizer = CountVectorizer

Please help me make it clear. Thank you for your time.

Upvotes: 1

Views: 293

Answers (1)

Nicolas Gervais
Nicolas Gervais

Reputation: 36724

By default, CountVectorizer does not tokenize the input. Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. In other words, it turns a long string like 'This is the input' into a sequence:

['This', 'is', 'the', 'input']

If you specify the tokenizer argument with a callable in CountVectorizer, it will use this function to tokenize the input (source).

Upvotes: 1

Related Questions