Reputation: 589
I am trying to build a classification system with Apache Spark's MLlib. I have shortlisted Naive Bayes algorithm to do this, and will be using Java 8 for the support of Lambda expressions. I am a newbie in terms of lambda expressions and hence am facing difficulty in implementing the same in Java.
I am referring to the following link which has the sample written in Scala but am having a hard time converting it to Java 8.
I am stuck on the following operation and can't get my head around it due to my unfamiliarity with Scala,
val idfs = (termDocsRdd.flatMap(termDoc => termDoc.terms.map((termDoc.doc, _))).distinct().groupBy(_._2) collect {
// if term is present in less than 3 documents then remove it
case (term, docs) if docs.size > 3 =>
term -> (numDocs.toDouble / docs.size.toDouble)
}).collect.toMap
Can someone please point me the right direction about how to build TfIdf vectors for textual document samples while utilizing Sparks RDD operations for distributed processing?
Upvotes: 1
Views: 922
Reputation: 31513
Ok ill explain line by line, but its quite easy to look up each method in the Scala API docs. Also in the long run you will make your life much easier by sticking to Scala rather than using ultra verbose java.
The first line could be written
val idfs = (termDocsRdd.flatMap(termDoc => termDoc.terms.map(term => (termDoc.doc, term)))
So its just taking each doc's terms, concatenating them all together and adding the termDoc.doc
as a key.
.distinct()
^^Obvious
.groupBy(_._2)
We group by the term, so now each term is a key and the value is a Seq
of docs
collect {
case (term, docs) if docs.size > 3 =>
term -> (numDocs.toDouble / docs.size.toDouble)
})
collect
is clever function that is like a filter
followed by a map
, we first filter by the pattern, so ... if docs.size > 3
, then map to term -> (numDocs.toDouble / docs.size.toDouble)
So we now have the term as the key and a Double
as the value. Finally the last line just turns this RDD
into a regular Scala Map
:
.collect.toMap
collect
here is a stupid name and I think may eventually be deprecated, toArray
does the same thing and is far less confusing
Upvotes: 2