Prabin Shrestha
Prabin Shrestha

Reputation: 7

Euclidean Distance between two vectors in two columns in spark data frme

I am trying to get euclidean distance between two vectors, in different columns in a spark dataframe. i need this operation done on spark. i have spend a lot of time trying to do this, but can't figure it out.

below is an example code similar to my case.

from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vector, Vectors
from pyspark.sql.functions import expr
from pyspark.sql import SparkSession, types as T, functions as F

# Initialize Spark session
spark = SparkSession.builder \
    .appName("euclidian distance") \
    .getOrCreate()

# Sample data
data = [(Vectors.dense([1.0, 2.0]), Vectors.dense([3.0, 4.0])),
        (Vectors.dense([5.0, 6.0]), Vectors.dense([7.0, 8.0])),
        (Vectors.dense([9.0, 10.0]), Vectors.dense([11.0, 12.0]))]

# Create DataFrame
df = spark.createDataFrame(data, ["vector1", "vector2"])

# Define UDF for vector subtraction
def ed_vectors_udf(v1, v2):
  return Vectors.dense(v1).squared_distance(Vectors.dense(v2))
  # return Vectors.dense([x - y for x, y in zip(v1, v2)])

# Register UDF
# ed_vectors_udf = fn.udf(lambda v1, v2: eq_vectors(v1, v2), T.DoubleType())
spark.udf.register("ed_vectors_udf", ed_vectors_udf, T.DoubleType())

# Subtract vectors using UDF
df = df.withColumn("distance", ed_vectors_udf(fn.col('vector1'), fn.col('vector2')))

# Show DataFrame with subtraction result
df.show(5)

Link to google collab test: https://colab.research.google.com/drive/1jQu2LFHV0mXK2mu3_sOWazMGbm49gvwi?usp=sharing

Upvotes: 0

Views: 142

Answers (1)

ARCrow
ARCrow

Reputation: 1857

Check this out:

import pyspark.sql.functions as f

df = (
  spark.createDataFrame([
    ([1,2], [3, 4]),
    ([5,6], [7, 8]),
    ([9,10], [11, 12])
  ], ['vector1', 'vector2'])
  .withColumn('distance', f.expr('pow(aggregate(transform(vector1, (x, i) -> pow(vector2[i] - x, 2)), cast(0 as double), (acc, x) -> acc + x), 0.5)'))
)

And the output is:

+-------+--------+------------------+                                           
|vector1| vector2|          distance|
+-------+--------+------------------+
| [1, 2]|  [3, 4]|2.8284271247461903|
| [5, 6]|  [7, 8]|2.8284271247461903|
|[9, 10]|[11, 12]|2.8284271247461903|
+-------+--------+------------------+

Upvotes: 1

Related Questions