CODEWITHSUNDEEP

hadoopapache-sparkmapreduceapache-spark-mllib

Reputation: 6465

How to do text analysis in Spark

I'm quite familiar with Hadoop but totally new to Apache Spark. Currently I'm using LDA (Latent Dirichlet Allocation) algorithm implemented in Mahout to do topic discovery. However as I need to make the process faster I'd like to use spark, however the LDA (or CVB) algorithm is not implemented in Spark MLib. Does this mean that I have to implement it from scratch by myself? If so, does Spark provide some tools that make it easier?

Upvotes: 2

Views: 2722

Answers (3)

Xinh Huynh

Reputation: 151

Regarding how to use the new Spark LDA API in 1.3:

Here is an article describing the new API:Topic modeling with LDA: MLlib meets GraphX

And, it links to example code showing how to vectorize text input: Github LDA Example

Upvotes: 3

Olivier Girardot

Reputation: 4648

Actually Spark 1.3.0 is out now so LDA is available !!

c.f. https://issues.apache.org/jira/browse/SPARK-1405

Regards,

Upvotes: 3

Jean Logeart

Reputation: 53809

LDA has been added to Spark very recently. It is not part of the current 1.2.1 release.

Yet, you can find an example on the current SNAPSHOT version: LDAExample.scala

You can also read interesting information about the SPARK-1405 issue.

So how can I use it?

The simplest way while it is not released is probably to copy the following classes in your project, as if you coded them yourself:

Upvotes: 3

Related Questions