figs_and_nuts
figs_and_nuts

Reputation: 5771

what is the difference between udf from SparkSession and udf from pyspark.sql.functions

I have two ways to use udf in pyspark:

1.

spark = pyspark.sql.SparkSession.builder.getOrCreate()
print(spark.udf)
output:
<pyspark.sql.udf.UDFRegistration at 0x7f5532f823a0>
from pyspark.sql.functions import udf
print(udf)
output:
<function pyspark.sql.functions.udf(f=None, returnType=StringType)>

I do not understand what is the intended difference between the two. I get doubts like why are two APIs available.spark.udf has a method called register available with it. I think registering a udf is necessary. Then, why is it not available in pyspark.sql.functions. Why is it needed only for the first case?

Can you please help me clarify these doubts?

Upvotes: 0

Views: 2048

Answers (1)

blackbishop
blackbishop

Reputation: 32710

spark.udf.register is used to register UDF to be invoked in Spark SQL query. While pyspark.sql.functions.udf is used to create UDF to be called when using DataFrame API.

Register UDF and use with SQL
from pyspark.sql.types import LongType

df = spark.range(1, 5)
df.createOrReplaceTempView("tb")

def plus_one(v):
    return v + 1

spark.udf.register("plus_one_udf", plus_one, LongType())

spark.sql("select id, plus_one_udf(id) as id2 from tb").show()
#+---+---+
#| id|id2|
#+---+---+
#|  1|  2|
#|  2|  3|
#|  3|  4|
#|  4|  5|
#+---+---+
Using with DataFrame API
import pyspark.sql.functions as F

plus_one_udf = F.udf(plus_one, LongType())

df.withColumn("id2", plus_one_udf(F.col("id"))).show()

#+---+---+
#| id|id2|
#+---+---+
#|  1|  2|
#|  2|  3|
#|  3|  4|
#|  4|  5|
#+---+---+

Upvotes: 1

Related Questions