Reputation: 7587
I have a dataframe with two columns where each row has a Sparse Vector. I try to find a proper way to calculate the cosine similarity (or just the dot product) of the two vectors in each row.
However, I haven't been able to find any library or tutorial to do it for Sparse vectors.
The only way I found is the following:
Create a k X n matrix, where n items are described as k-dimensioned vectors. For representing each item as a k dimension vector, you can use ALS which represents each entity in a latent factor space. The dimension of this space (k) can be chosen by you. This k X n matrix can be represented as RDD[Vector].
Convert this k X n matrix to RowMatrix.
Use columnSimilarities() function to get a n X n matrix of similarities between n items.
I feel it is an overkill to calculate all the cosine similarities for each pair while I need it only for the specific pairs in my (quite big) dataframe.
Upvotes: 2
Views: 2160
Reputation: 47
Great answer above @Sergey-Zakharov +1. A few adds-on:
val normalizer = new Normalizer()
.setInputCol("features")
.setOutputCol("normFeatures")
.setP(2.0)
val l2NormData = normalizer.transform(df_features)
and
val dotProduct = udf {(v1: SparseVector, v2: SparseVector) =>
v1.indices.intersect(v2.indices).map(x => v1(x) * v2(x)).reduceOption(_ + _).getOrElse(0.0)
}
and then
val df = dfA.crossJoin(broadcast(dfB))
.withColumn("dot", dotProduct(col("featuresA"), col("featuresB")))
Upvotes: 1
Reputation: 1615
In Spark 3
there is now method dot
for a SparseVector
object, which takes another vector as its argument.
If you want to do this in earlier versions, you could create a user defined function that follows this algorithm:
Here's my realization of it:
import org.apache.spark.ml.linalg.SparseVector
def dotProduct(vec: SparseVector, vecOther: SparseVector) = {
val commonIndices = vec.indices intersect vecOther.indices
commonIndices.map(x => vec(x) * vecOther(x)).reduce(_+_)
}
I guess you know how to turn it into a Spark
UDF
from here and apply it to your dataframe's columns.
And if you normalize your sparse vectors with org.apache.spark.ml.feature.Normalizer
before computing your dot product, you'll get cosine similarity in the end (by definition).
Upvotes: 1
Reputation: 4427
If the number of vectors you want to calculate the dot product with is small, cache the RDD[Vector] table. Create a new table [cosine_vectors]
that is a filter on the original table to only select the vectors you want the cosine similarities for. Broadcast join those two together and calculate.
Upvotes: 0