TF - IDF rdds into readable format using spark

Question

I am trying to calculate TF-IDF for documents of strings and I am referring http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf link.

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF

val sc: SparkContext = ...

// Load documents (one per line).
val documents: RDD[Seq[String]] = sc.textFile("...").map(_.split("       ").toSeq)

val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)

output:

Array((1048576,[1088,35436,98482,1024805],[2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776]), (1048576,[49,34227,39165,114066,125344,240472,312955,388260,436506,469864,493361,496101,566174,747007,802226],[2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776]),...

With this I am getting a RDD of vectors but I am not able to get any information from this vector about the original strings. Can anyone help me out in mapping indexes to original strings?

TF - IDF rdds into readable format using spark

Answers (1)

Related Questions