Tasos
Tasos

Reputation: 7587

Cosine similarity of two sparse vectors in Scala Spark

I have a dataframe with two columns where each row has a Sparse Vector. I try to find a proper way to calculate the cosine similarity (or just the dot product) of the two vectors in each row.

However, I haven't been able to find any library or tutorial to do it for Sparse vectors.

The only way I found is the following:

  1. Create a k X n matrix, where n items are described as k-dimensioned vectors. For representing each item as a k dimension vector, you can use ALS which represents each entity in a latent factor space. The dimension of this space (k) can be chosen by you. This k X n matrix can be represented as RDD[Vector].

  2. Convert this k X n matrix to RowMatrix.

  3. Use columnSimilarities() function to get a n X n matrix of similarities between n items.

I feel it is an overkill to calculate all the cosine similarities for each pair while I need it only for the specific pairs in my (quite big) dataframe.

Upvotes: 2

Views: 2160

Answers (3)

belz
belz

Reputation: 47

Great answer above @Sergey-Zakharov +1. A few adds-on:

  1. The reduce doesn't work on empty sequences.
  2. Make sure computing L2 normalization.
val normalizer = new Normalizer()
  .setInputCol("features")
  .setOutputCol("normFeatures")
  .setP(2.0)

val l2NormData = normalizer.transform(df_features)

and

val dotProduct = udf {(v1: SparseVector, v2: SparseVector) =>
    v1.indices.intersect(v2.indices).map(x => v1(x) * v2(x)).reduceOption(_ + _).getOrElse(0.0)
}

and then

val df = dfA.crossJoin(broadcast(dfB))
    .withColumn("dot", dotProduct(col("featuresA"), col("featuresB")))

Upvotes: 1

Sergey Zakharov
Sergey Zakharov

Reputation: 1615

In Spark 3 there is now method dot for a SparseVector object, which takes another vector as its argument.

If you want to do this in earlier versions, you could create a user defined function that follows this algorithm:

  • Take intersection of your vectors' indices.
  • Get two subarrays of your vectors' values based on the indices from the intersection.
  • Do pairwise multiplication of the elements of those two subarrays.
  • Sum the values resulting values from such pairwise multiplication.

Here's my realization of it:

import org.apache.spark.ml.linalg.SparseVector

def dotProduct(vec: SparseVector, vecOther: SparseVector) = {
    val commonIndices = vec.indices intersect vecOther.indices
    commonIndices.map(x => vec(x) * vecOther(x)).reduce(_+_)
}

I guess you know how to turn it into a Spark UDF from here and apply it to your dataframe's columns.

And if you normalize your sparse vectors with org.apache.spark.ml.feature.Normalizer before computing your dot product, you'll get cosine similarity in the end (by definition).

Upvotes: 1

Carlos Bribiescas
Carlos Bribiescas

Reputation: 4427

If the number of vectors you want to calculate the dot product with is small, cache the RDD[Vector] table. Create a new table [cosine_vectors] that is a filter on the original table to only select the vectors you want the cosine similarities for. Broadcast join those two together and calculate.

Upvotes: 0

Related Questions