Reputation: 680
So I wanted to train a Naive Bayes Algorithm over some documents and the below code would just run fine if I had documents in the form of strings. But the issues is the strings I have goes through a series of pre-processing step which is more then stopword remove, lemmatization etc rather there are some custom conversion which returns a list of ngrams, where n can [1,2,3] depending on the context of text. So now since I have list of ngram instead of a string representing a document I am confused how can I represent the same as an input to CountVectorizer. Any suggestions?
Code that would work fine with docs as a document array of type string.
count_vectorizer = CountVectorizer(binary='true')
data = count_vectorizer.fit_transform(docs)
tfidf_data = TfidfTransformer(use_idf=False).fit_transform(data)
classifier = BernoulliNB().fit(tfidf_data,op)
Upvotes: 1
Views: 1360
Reputation: 4749
You should combine all your pre-processing steps into preprocessor and maybe tokenizer functions, see section 4.2.3.10 and CountVectorizer description from scikit-learn docs. For example of such tokenizers/transformers see related question of src code of scikit-learn itself.
Upvotes: 1