pySpark DataFrame: how to parallelize compare the columns of two dataframe?

Question

I have two DataFrames, and I want to apply distance.euclidean(df1.select(col),df2.select(col)) for each column of the two DataFrame.

Example:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *
spark = SparkSession.builder.getOrCreate()

df1 = spark.createDataFrame([(1,10),(2,13)],["A","B"])
df2 = spark.createDataFrame([(3,40),(2,20)],["A","B"])

# Apply distance function for each columns of `df1` and `df2`
from scipy.spatial import distance
for col in df1.columns:
    d = distance.euclidean(df1.select(col).rdd.flatMap(lambda x:x).collect(), df2.select(col).rdd.flatMap(lambda x:x).collect())
    print(col,d)

The numbers of columns is large, about 5,000. Is there any method calculate the distance of the columns in parallel instead of calculating one by one using for function.

pySpark DataFrame: how to parallelize compare the columns of two dataframe?

Answers (1)

Related Questions