Reputation: 733
I am using spark mllib for one of my projects in which I need to calculate document similarities.
I first converted the documents to vectors using tf-idf transform of the mllib, then converted it into RowMatrix and used the columnSimilarities() method.
I referred to tf-idf documentation and used the DIMSUM implementation for cosine similarities.
in spark-shell this is the scala code is executed:
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val documents = sc.textFile("test1").map(_.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)
// now use the RowMatrix to compute cosineSimilarities
// which implements DIMSUM algorithm
val mat = new RowMatrix(tfidf)
val sim = mat.columnSimilarities() // returns a CoordinateMatrix
Now let's say my input file
, test1
in this code block is a simple file with 5 short documents (less than 10 terms each), one on each row.
Since I am just testing this code, I would like to see the output of mat.columnSimilarities()
which is in object sim
.
I would like to see the similarity of 1st document vector with 2nd, 3rd and so on.
I referred to spark documentation for CoordinateMatrix
which is the type of object returned by columnSimilarities
method of RowMatrix
class and referred by sim
.
By going through more documentation, I figured I could convert the CoordinateMatrix to RowMatrix, then convert the rows of RowMatrix to arrays and then print like this println(sim.toRowMatrix().rows.toArray().mkString("\n"))
.
But that gives some output which I couldn't understand.
Can anyone help? Any kind of resource links etc would help a lot!
Thanks!
Upvotes: 2
Views: 2951
Reputation: 4333
You can try the following, no need to convert to row matrix format
val transformedRDD = sim.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => Array(row,col,sim).mkString(",")}
To retrieve the elements you can invoke the following action
transformedRDD.collect()
Upvotes: 4