Reputation: 396
I was trying to better understand how the CountVectorizer class works.
I'm quite confused about the differences between the preprocessor, tokenizer and analyzer parameters.
In the documentation it is stated that all of this parameters can be callable, my guess is that you can produce your own function to customize the various processes.
That said, I'm not sure why they are mutually exclusive (i.e. preprocessor can be callable if and only if analyzer is None, similarly tokenizer can be a callable if and only if analyzer='word' - from the doc).
I'd much appreciate if someone could elucidate over the different usage of the parameters and what the relative step is supposed to accomplish.
Thanks in advance, let me know if the question is not problem specific enough for stackoverflow!
Upvotes: 3
Views: 2498
Reputation: 2478
There is an explanation provided in the documentation.
preprocessor: a callable that takes an entire document as input (as a single string), and returns a possibly transformed version of the document, still as an entire string. This can be used to remove HTML tags, lowercase the entire document, etc.
tokenizer: a callable that takes the output from the preprocessor and splits it into tokens, then returns a list of these.
analyzer: a callable that replaces the preprocessor and tokenizer. The default analyzers all call the preprocessor and tokenizer, but custom analyzers will skip this. N-gram extraction and stop word filtering take place at the analyzer level, so a custom analyzer may have to reproduce these steps.
So preprocessor and tokenizer work together and preprocessor can be a callable if analyzer is None, because then the default analyzer will call preprocessor. If analyzer is not None, the preprocessor is not required anymore. And I am assuming the tokenizer is only called (and therefore callable), if the analyzer operates on the "word" level.
Upvotes: 4