Reputation: 1859
// 4 workers
val sc = new SparkContext("local[4]", "naivebayes")
// Load documents (one per line).
val documents: RDD[Seq[String]] = sc.textFile("/tmp/test.txt").map(_.split(" ").toSeq)
documents.zipWithIndex.foreach{
case (e, i) =>
val collectedResult = Tokenizer.tokenize(e.mkString)
}
val hashingTF = new HashingTF()
//pass collectedResult instead of document
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
in the above code snippet, i would want to extract collectedResult to reuse it for hashingTF.transform, How can this be achieved where the signature of tokenize function is
def tokenize(content: String): Seq[String] = {
...
}
Upvotes: 0
Views: 3918
Reputation: 17431
Looks like you want map
rather than foreach
. I don't understand what you're using zipWithIndex
for, nor why you're calling split
on your lines only to join them straight back up again with mkString
.
val lines: Rdd[String] = sc.textFile("/tmp/test.txt")
val tokenizedLines = lines.map(tokenize)
val hashes = tokenizedLines.map(hashingTF)
hashes.cache()
...
Upvotes: 1