pan8863
pan8863

Reputation: 733

print CoordinateMatrix after using RowMatrix.columnSimilarities in Apache Spark

I am using spark mllib for one of my projects in which I need to calculate document similarities.

I first converted the documents to vectors using tf-idf transform of the mllib, then converted it into RowMatrix and used the columnSimilarities() method.

I referred to tf-idf documentation and used the DIMSUM implementation for cosine similarities.

in spark-shell this is the scala code is executed:

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF
import org.apache.spark.mllib.linalg.distributed.RowMatrix

val documents = sc.textFile("test1").map(_.split(" ").toSeq)
val hashingTF = new HashingTF()

val tf = hashingTF.transform(documents)
tf.cache()

val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)

// now use the RowMatrix to compute cosineSimilarities
// which implements DIMSUM algorithm

val mat = new RowMatrix(tfidf)
val sim = mat.columnSimilarities() // returns a CoordinateMatrix

Now let's say my input file, test1 in this code block is a simple file with 5 short documents (less than 10 terms each), one on each row.

Since I am just testing this code, I would like to see the output of mat.columnSimilarities() which is in object sim. I would like to see the similarity of 1st document vector with 2nd, 3rd and so on.

I referred to spark documentation for CoordinateMatrix which is the type of object returned by columnSimilarities method of RowMatrix class and referred by sim.

By going through more documentation, I figured I could convert the CoordinateMatrix to RowMatrix, then convert the rows of RowMatrix to arrays and then print like this println(sim.toRowMatrix().rows.toArray().mkString("\n")) .

But that gives some output which I couldn't understand.

Can anyone help? Any kind of resource links etc would help a lot!

Thanks!

Upvotes: 2

Views: 2951

Answers (1)

tourist
tourist

Reputation: 4333

You can try the following, no need to convert to row matrix format

val transformedRDD = sim.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => Array(row,col,sim).mkString(",")}

To retrieve the elements you can invoke the following action

transformedRDD.collect()

Upvotes: 4

Related Questions