Reputation: 1455
I'm looking for some topic modeling tool which can be applicable to a large data set.
My current data set for training is 30 GB. I tried MALLET topic modeling, but always I got OutOfMemoryError.
If you have any tips, please let me know.
Upvotes: 5
Views: 3795
Reputation: 7940
The GraphLab Create topic model toolkit (with Python API bindings) should be able to handle a dataset that large.
Upvotes: 1
Reputation: 6227
I suggest using a "big data" tool such as graphlab, which supports topic modeling: http://docs.graphlab.org/topic_modeling.html
Upvotes: 2
Reputation: 724
There are many options available to you, and this response is agnostic as to how they compare.
I think that the important thing with such a large dataset is the method of approximate posterior inference used, and not necessarily the software implementation. According to this paper, online Variational Bayes inference is much more efficient, in terms of time and space, than Gibbs sampling. Though I've never used it, the gensim package looks good. It's in python, and there are in-depth tutorials at the project's webpage.
For code that comes straight from the source, see the webpage of David Blei, one of the authors of the LDA model, here. He links to more than a few implementations, in a variety of languages (R, Java, C++).
Upvotes: 2