Reputation: 440
I need to compute similarities between columns of a row and tried columnsimilarities() method to get results.
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("CollarberativeFilter").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
SparkSession spark = SparkSession.builder().appName("CollarberativeFilter").getOrCreate();
double[][] array = {{5,0,5}, {0,10,0}, {5,0,5}};
LinkedList<Vector> rowsList = new LinkedList<Vector>();
for (int i = 0; i < array.length; i++) {
Vector currentRow = Vectors.dense(array[i]);
rowsList.add(currentRow);
}
JavaRDD<Vector> rows = sc.parallelize(rowsList);
// Create a RowMatrix from JavaRDD<Vector>.
RowMatrix mat = new RowMatrix(rows.rdd());
CoordinateMatrix simsPerfect = mat.columnSimilarities();
RowMatrix mat2 = simsPerfect.toRowMatrix();
List<Vector> vs2 = mat2.rows().toJavaRDD().collect();
List<Vector> vs = mat.rows().toJavaRDD().collect();
System.out.println("mat");
for(Vector v: vs) {
System.out.println(v);
}
System.out.println("mat2");
for(Vector v: vs2) {
System.out.println(v);
}
JavaRDD<MatrixEntry> entries = simsPerfect.entries().toJavaRDD();
JavaRDD<String> output = entries.map(new Function<MatrixEntry, String>() {
public String call(MatrixEntry e) {
return String.format("%d,%d,%s", e.i(), e.j(), e.value());
}
});
output.saveAsTextFile("resources123/data.txt");
}
But the
output in the text file was 0,2,0.9999999999999998
.
Next I tried the same example using double[][] array = {{1,3}, {2,7}};
Then the
output of the text file was 0,1,0.9982743731749959
Can someone explain me the answer format.Can't I get a score for each and every column pair of the matrix.Such as in 3 by 3 matrix I need 3 scores for similarity between 1,2 columns , 2,3 columns , 3,1 columns. Any help appreciated.
Upvotes: 3
Views: 2457
Reputation: 5572
Column Similarity is computed with the Cosine Similarity defined as follows:
Since you included the scala
tag I am going to cheat and repeat what you did in the Scala REPL:
scala> import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.mllib.linalg.{Vectors, Vector}
scala> import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
scala> val matVec = Vector(Vectors.dense(5,0,5), Vectors.dense(0,10,0), Vectors.dense(5,0,5))
matVec: scala.collection.immutable.Vector[org.apache.spark.mllib.linalg.Vector] = Vector([5.0,0.0,5.0], [0.0,10.0,0.0], [5.0,0.0,5.0])
scala> val matRDD = sc.parallelize(matVec)
matRDD: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = ParallelCollectionRDD[44] at parallelize at <console>:37
scala> val myRowMat = new RowMatrix(matRDD)
myRowMat: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@7a7a07c2
scala> myRowMat.columnSimilarities.entries.collect.foreach{println}
MatrixEntry(0,2,0.9999999999999998)
This output means that there was only one nonzero entry at (row0
, col2
). Thus the actual (upper triangular) output was:
0 0 .9999
0 0 0
0 0 0
Which is what you would expect (since the dot product between col0
and col1
is zero and the dot product between col1
and col2
is zero)
Here is an example with a less sparse column similarities matrix:
scala> def randVec(len: Int) : org.apache.spark.mllib.linalg.Vector =
| Vectors.dense(Array.fill(len)(Random.nextDouble))
randVec: (len: Int)org.apache.spark.mllib.linalg.Vector
scala> val randRDD = sc.parallelize(Seq.fill(3)(randVec(4))
randRDD: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = ParallelCollectionRDD[123] at parallelize at <console>:38
scala> val randRowMat = new RowMatrix(randRDD)
randRowMat: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@77d9112e
scala> randRowMat.rows.collect.foreach{println}
[0.11049508671100228,0.6560383649078886,0.08647831963379027,0.918734774579884]
[0.5709766390994561,0.5404121150599919,0.8206115742925799,0.12848224469499103]
[0.5414651842028494,0.26273347471310016,0.3139446375461201,0.351113866208812]
scala> randRowMat.columnSimilarities.entries.collect.foreach{println}
MatrixEntry(0,3,0.4630854334046888)
MatrixEntry(0,2,0.9238294198864545)
MatrixEntry(2,3,0.33700154742702093)
MatrixEntry(0,1,0.7402725425024911)
MatrixEntry(1,2,0.7418690274112878)
MatrixEntry(1,3,0.8662504236158493)
Which represents the following matrix:
0 0.74027 0.92382 0.46308
0 0 0.74186 0.86625
0 0 0 0.33700
0 0 0 0
Upvotes: 3