Scikit-learn tfidf vectorizer in minibatches?

Question

I've been trying to perform tf-idf heuristic on a large corpus.

Can I iteratively read the documents, and call the

vectorizer.fit()

In each iteration? Does this take into account only the current iteration, or does it remember the previous ones?

Thanks!

benbo · Accepted Answer

The solution to your problem will depend on your particular application. You could consider gensim's tfidf implementation which is more efficient and does not need to keep the entire corpus in memory as this post explains.

Scikit-learn tfidf vectorizer in minibatches?

Answers (1)

Related Questions