How to calculate dissimilarity matrix in Spark?

Question

Is there any function or method that calculates dissimilarity matrix for a given data set? I've found All-pairs similarity via DIMSUM but it looks like it works for sparse data only. Mine is really dense.

Mateusz Dymczyk · Accepted Answer

Even though the original DIMSUM paper is talking about a matrix which:

each dimension is sparse with at most L nonzeros per row

And which values are:

the entries of A have been scaled to be in [−1, 1]

This is not a requirement and you can run it on a dense matrix. Actually if you check the sample code by the DIMSUM author from the databricks blog you'll notice that the RowMatrix is in fact created from an RDD of dense vectors:

// Load and parse the data file.
val rows = sc.textFile(filename).map { line =>
    val values = line.split(' ').map(_.toDouble)
    Vectors.dense(values)
}
val mat = new RowMatrix(rows)

Similarly the comment in CosineSimilarity Spark example gives as input a dense matrix which is not scaled.

You need to be aware that the only available method is the columnSimilarities(), which calculates similarities between columns. Hence if your input data file is structured in a way record = row, then you will have to do a matrix transpose first and then run the similarity. To answer your question, no there is no transpose on RowMatrix, other types of matrices in MLlib do have that feature so you'd have to do some transformations first.

Row similarity is in the works and did not make it into the newest Spark 1.5 unfortunately.

As for other options, you would have to implement them yourself. The naive brute force solution which requires O(mL^2) shuffles is very easy to implement (cartesian + your similiarity measure of choice) but performs very badly (speaking from experience).

You can also have a look at a different algorithm from the same person called DISCO but it's not implemented in Spark (and the paper also assumes L-sparsity).

Finally be advised that both DIMSUM and DISCO are estimates (although extremely good ones).

How to calculate dissimilarity matrix in Spark?

Answers (1)

Related Questions