Reputation: 505
I am trying to calculate TF-IDF for documents of strings and I am referring http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf link.
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF
val sc: SparkContext = ...
// Load documents (one per line).
val documents: RDD[Seq[String]] = sc.textFile("...").map(_.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
output:
Array((1048576,[1088,35436,98482,1024805],[2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776]), (1048576,[49,34227,39165,114066,125344,240472,312955,388260,436506,469864,493361,496101,566174,747007,802226],[2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776]),...
With this I am getting a RDD of vectors but I am not able to get any information from this vector about the original strings. Can anyone help me out in mapping indexes to original strings?
Upvotes: 1
Views: 2900
Reputation: 671
It is hard to answer to your question without more information. My best guess is that you may want to extract the TFIDF value for some term of some document.
tfidf you get at your last line is a RDD of Vector : for every document in your corpus (which is a RDD[Seq[String]]) , you get back a Vector representing the document. Every term in the document has a specific TFIDF value in this vector.
To know the position of a term in the vector, and retrieve the TFIDF :
val position = hashingTF.indexOf("term")
Then use it to retrieve the tfidf value for the given document calling the apply method on the Vector (first document in documents in this example) :
tfidf.first.apply(position)
Raw frequencies may be extracted the same way using tf instead of tfidf in the line above.
With the implementation of Spark using a hashing trick (see documentation and wikipedia article) my understanding is that it is not possible to retrieve the terms from the Vector : this is due to the fact that the hashing function is one way by definition, and that this "trick" may causes collisions (several terms may produce the same hash).
Upvotes: 3