How to use the output of RowMatrix.columnSimilarities

Question

I need to compute similarities between columns of a row and tried columnsimilarities() method to get results.

public static void main(String[] args) {

    SparkConf sparkConf = new SparkConf().setAppName("CollarberativeFilter").setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(sparkConf);
        SparkSession spark = SparkSession.builder().appName("CollarberativeFilter").getOrCreate();
        double[][] array = {{5,0,5}, {0,10,0}, {5,0,5}};
        LinkedList rowsList = new LinkedList();
        for (int i = 0; i < array.length; i++) {
          Vector currentRow = Vectors.dense(array[i]);
          rowsList.add(currentRow);
        }
        JavaRDD rows = sc.parallelize(rowsList);

        // Create a RowMatrix from JavaRDD.
        RowMatrix mat = new RowMatrix(rows.rdd());
         CoordinateMatrix simsPerfect = mat.columnSimilarities();
         RowMatrix mat2 = simsPerfect.toRowMatrix();
         List vs2 = mat2.rows().toJavaRDD().collect();
         List vs = mat.rows().toJavaRDD().collect();
         System.out.println("mat");
         for(Vector v: vs) {
             System.out.println(v);
         }
         System.out.println("mat2");
         for(Vector v: vs2) {
             System.out.println(v);
         }
         JavaRDD entries = simsPerfect.entries().toJavaRDD();
         JavaRDD output = entries.map(new Function() {
             public String call(MatrixEntry e) {
                 return String.format("%d,%d,%s", e.i(), e.j(), e.value());
             }
         });
         output.saveAsTextFile("resources123/data.txt");

}

But the

output in the text file was 0,2,0.9999999999999998

.

Next I tried the same example using double[][] array = {{1,3}, {2,7}}; Then the

output of the text file was 0,1,0.9982743731749959

Can someone explain me the answer format.Can't I get a score for each and every column pair of the matrix.Such as in 3 by 3 matrix I need 3 scores for similarity between 1,2 columns , 2,3 columns , 3,1 columns. Any help appreciated.

evan.oman · Accepted Answer

Column Similarity is computed with the Cosine Similarity defined as follows:

Since you included the scala tag I am going to cheat and repeat what you did in the Scala REPL:

scala> import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.mllib.linalg.{Vectors, Vector}

scala> import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix

scala> val matVec = Vector(Vectors.dense(5,0,5), Vectors.dense(0,10,0), Vectors.dense(5,0,5))
matVec: scala.collection.immutable.Vector[org.apache.spark.mllib.linalg.Vector] = Vector([5.0,0.0,5.0], [0.0,10.0,0.0], [5.0,0.0,5.0])

scala> val matRDD = sc.parallelize(matVec)
matRDD: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = ParallelCollectionRDD[44] at parallelize at :37

scala> val myRowMat = new RowMatrix(matRDD)
myRowMat: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@7a7a07c2

scala> myRowMat.columnSimilarities.entries.collect.foreach{println}
MatrixEntry(0,2,0.9999999999999998)

This output means that there was only one nonzero entry at (row0, col2). Thus the actual (upper triangular) output was:

0    0    .9999
0    0    0
0    0    0

Which is what you would expect (since the dot product between col0 and col1 is zero and the dot product between col1 and col2 is zero)

Here is an example with a less sparse column similarities matrix:

scala> def randVec(len: Int) : org.apache.spark.mllib.linalg.Vector =
     | Vectors.dense(Array.fill(len)(Random.nextDouble))
randVec: (len: Int)org.apache.spark.mllib.linalg.Vector

scala> val randRDD = sc.parallelize(Seq.fill(3)(randVec(4))
randRDD: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = ParallelCollectionRDD[123] at parallelize at :38

scala> val randRowMat = new RowMatrix(randRDD)
randRowMat: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@77d9112e

scala> randRowMat.rows.collect.foreach{println}
[0.11049508671100228,0.6560383649078886,0.08647831963379027,0.918734774579884]
[0.5709766390994561,0.5404121150599919,0.8206115742925799,0.12848224469499103]
[0.5414651842028494,0.26273347471310016,0.3139446375461201,0.351113866208812]

scala> randRowMat.columnSimilarities.entries.collect.foreach{println}
MatrixEntry(0,3,0.4630854334046888)
MatrixEntry(0,2,0.9238294198864545)
MatrixEntry(2,3,0.33700154742702093)
MatrixEntry(0,1,0.7402725425024911)
MatrixEntry(1,2,0.7418690274112878)
MatrixEntry(1,3,0.8662504236158493)

Which represents the following matrix:

0       0.74027     0.92382     0.46308
0       0           0.74186     0.86625
0       0           0           0.33700
0       0           0           0

How to use the output of RowMatrix.columnSimilarities

Answers (1)

Related Questions