Classification with Spark MLlib in Java

Question

I am trying to build a classification system with Apache Spark's MLlib. I have shortlisted Naive Bayes algorithm to do this, and will be using Java 8 for the support of Lambda expressions. I am a newbie in terms of lambda expressions and hence am facing difficulty in implementing the same in Java.

I am referring to the following link which has the sample written in Scala but am having a hard time converting it to Java 8.

http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/

I am stuck on the following operation and can't get my head around it due to my unfamiliarity with Scala,

val idfs = (termDocsRdd.flatMap(termDoc => termDoc.terms.map((termDoc.doc, _))).distinct().groupBy(_._2) collect {
  // if term is present in less than 3 documents then remove it
  case (term, docs) if docs.size > 3 =>
    term -> (numDocs.toDouble / docs.size.toDouble)
}).collect.toMap

Can someone please point me the right direction about how to build TfIdf vectors for textual document samples while utilizing Sparks RDD operations for distributed processing?

Classification with Spark MLlib in Java

Answers (1)

Related Questions