Do User Defined Functions (UDF) in Spark Run on Parallel on Cluster Worker Nodes?

Question

Let assume I have created a function in python so raised a number to power 2:

def squared(s):
  return s * s

And then I registered the function in Spark session as below:

spark.udf.register("squaredWithPython", squared)

then when I call the UDF in Spark SQL as in:

spark.range(1, 20).registerTempTable("test")
%sql select id, squaredWithPython(id) as id_squared from test

Then is the function squaredWithPython going to run on the worker nodes of the cluster, if the data is distributed on the workers memory? If yes, then what vectorized UDFs used for? And what is the difference between UDF and vectorized UDF?

Likewise, for the use of UDF with DataFrames.

Please note that the code is retrieved from: https://docs.databricks.com/spark/latest/spark-sql/udf-python.html

Any help is much appreciated!!

I. A · Accepted Answer

The difference between UDF and Pandas_UDF is: the UDF function will apply a function one row at a time on the dataframe or SQL table. Additionally, every row at a time will be serialized (converted into python object) before the python function is applied. On the other hand, Pandas_UDF will convert the whole spark dataframe into Pandas dataframe or Series, using Apache Arrow (much cheaper than serialization), and then apply the function python function on pandas dataframe. The function will be vectorized because the input is a pandas dataframe/Series and not one row at a time.

Do User Defined Functions (UDF) in Spark Run on Parallel on Cluster Worker Nodes?

Answers (1)

Related Questions