Chao
Chao

Reputation: 905

How to calculate the cosine similarity of two vectors in PySpark?

I am about to compute the cosine similarity of two vectors in PySpark, like

1 - spatial.distance.cosine(xvec, yvec)

but scipy seems to not support the pyspark.ml.linalg.Vector type.

Upvotes: 3

Views: 9111

Answers (1)

akuiper
akuiper

Reputation: 214957

You can use dot and norm methods to calculate this pretty easily:

from pyspark.ml.linalg import Vectors
x = Vectors.dense([1,2,3])
y = Vectors.dense([2,3,5])

1 - x.dot(y)/(x.norm(2)*y.norm(2))
# 0.0028235350472619603

With scipy:

from scipy.spatial.distance import cosine
​
x = np.array([1,2,3])
y = np.array([2,3,5])

cosine(x, y)
# 0.0028235350472619603

Upvotes: 10

Related Questions