Bryce Ramgovind
Bryce Ramgovind

Reputation: 3257

PySpark - Calling a function within a UDF

I have created a UDF however I need to call a function within a UDF. It currently returns nulls. Could someone please explain why I am getting this error.

a= spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "distances"])
def get_number(num):
    return range(num)
from pyspark.sql.functions import udf
def cate(label):
    if label == 20:
        counting_list = get_number(4)
        return counting_list
    else:
        return [0]

udf_score=udf(cate, ArrayType(FloatType()))
a.withColumn("category_list", udf_score(a["distances"])).show(10)

out:

+------+---------+--------------------+
|Letter|distances|       category_list|
+------+---------+--------------------+
|     A|       20|[null, null, null...|
|     B|       30|              [null]|
|     D|       80|              [null]|
+------+---------+--------------------+

Upvotes: 1

Views: 3893

Answers (1)

mkaran
mkaran

Reputation: 2718

The datatype for your udf is not correct, since cate returns an array of integers not floats. Can you please change:

udf_score=udf(cate, ArrayType(FloatType()))

to:

udf_score=udf(cate, ArrayType(IntegerType()))

Hope this helps!

edit: assuming Python 2.x regarding range since as @Shane Halloran mentions in the comments, range behaves differently in Python 3.x

Upvotes: 2

Related Questions