Reputation: 2312
Let assume I have created a function in python so raised a number to power 2:
def squared(s):
return s * s
And then I registered the function in Spark session
as below:
spark.udf.register("squaredWithPython", squared)
then when I call the UDF in Spark SQL as in:
spark.range(1, 20).registerTempTable("test")
%sql select id, squaredWithPython(id) as id_squared from test
Then is the function squaredWithPython
going to run on the worker nodes of the cluster, if the data is distributed on the workers memory? If yes, then what vectorized UDF
s used for? And what is the difference between UDF
and vectorized UDF
?
Likewise, for the use of UDF
with DataFrames.
Please note that the code is retrieved from: https://docs.databricks.com/spark/latest/spark-sql/udf-python.html
Any help is much appreciated!!
Upvotes: 2
Views: 3694
Reputation: 2312
The difference between UDF
and Pandas_UDF
is: the UDF function will apply a function one row at a time on the dataframe or SQL table. Additionally, every row at a time will be serialized (converted into python object) before the python function is applied. On the other hand, Pandas_UDF will convert the whole spark dataframe into Pandas dataframe or Series, using Apache Arrow (much cheaper than serialization), and then apply the function python function on pandas dataframe. The function will be vectorized because the input is a pandas dataframe/Series and not one row at a time.
Upvotes: 2