Python: Creating Term Document Matrix from list

Question

So I wanted to train a Naive Bayes Algorithm over some documents and the below code would just run fine if I had documents in the form of strings. But the issues is the strings I have goes through a series of pre-processing step which is more then stopword remove, lemmatization etc rather there are some custom conversion which returns a list of ngrams, where n can [1,2,3] depending on the context of text. So now since I have list of ngram instead of a string representing a document I am confused how can I represent the same as an input to CountVectorizer. Any suggestions?

Code that would work fine with docs as a document array of type string.

count_vectorizer = CountVectorizer(binary='true')
data = count_vectorizer.fit_transform(docs)

tfidf_data = TfidfTransformer(use_idf=False).fit_transform(data)
classifier = BernoulliNB().fit(tfidf_data,op)

Nikita Astrakhantsev · Accepted Answer

You should combine all your pre-processing steps into preprocessor and maybe tokenizer functions, see section 4.2.3.10 and CountVectorizer description from scikit-learn docs. For example of such tokenizers/transformers see related question of src code of scikit-learn itself.

Python: Creating Term Document Matrix from list

Answers (1)

Related Questions