Benben
Benben

Reputation: 1455

Topic Modeling tool for large data set (30GB)

I'm looking for some topic modeling tool which can be applicable to a large data set.

My current data set for training is 30 GB. I tried MALLET topic modeling, but always I got OutOfMemoryError.

If you have any tips, please let me know.

Upvotes: 5

Views: 3795

Answers (3)

Zach
Zach

Reputation: 7940

The GraphLab Create topic model toolkit (with Python API bindings) should be able to handle a dataset that large.

Upvotes: 1

Emre
Emre

Reputation: 6227

I suggest using a "big data" tool such as graphlab, which supports topic modeling: http://docs.graphlab.org/topic_modeling.html

Upvotes: 2

sinwav
sinwav

Reputation: 724

There are many options available to you, and this response is agnostic as to how they compare.

I think that the important thing with such a large dataset is the method of approximate posterior inference used, and not necessarily the software implementation. According to this paper, online Variational Bayes inference is much more efficient, in terms of time and space, than Gibbs sampling. Though I've never used it, the gensim package looks good. It's in python, and there are in-depth tutorials at the project's webpage.

For code that comes straight from the source, see the webpage of David Blei, one of the authors of the LDA model, here. He links to more than a few implementations, in a variety of languages (R, Java, C++).

Upvotes: 2

Related Questions