Reputation: 3257
I have created a UDF however I need to call a function within a UDF. It currently returns nulls. Could someone please explain why I am getting this error.
a= spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "distances"])
def get_number(num):
return range(num)
from pyspark.sql.functions import udf
def cate(label):
if label == 20:
counting_list = get_number(4)
return counting_list
else:
return [0]
udf_score=udf(cate, ArrayType(FloatType()))
a.withColumn("category_list", udf_score(a["distances"])).show(10)
out:
+------+---------+--------------------+
|Letter|distances| category_list|
+------+---------+--------------------+
| A| 20|[null, null, null...|
| B| 30| [null]|
| D| 80| [null]|
+------+---------+--------------------+
Upvotes: 1
Views: 3893
Reputation: 2718
The datatype for your udf is not correct, since cate
returns an array of integers not floats. Can you please change:
udf_score=udf(cate, ArrayType(FloatType()))
to:
udf_score=udf(cate, ArrayType(IntegerType()))
Hope this helps!
edit: assuming Python 2.x regarding range
since as @Shane Halloran mentions in the comments, range
behaves differently in Python 3.x
Upvotes: 2