Reputation: 776
I'm trying to add a column to my Spark DataFrame using withColumn
and udf that takes no arguments. This only seems to work if I use a lambda to encapsulate my original function.
Here's a MWE:
from pyspark.sql import Row, SparkSession
from pyspark.sql.functions import udf
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([Row(number=i) for i in range(10)])
def foo():
return 'bar'
udfoo = udf(foo())
df = df.withColumn('word', udfoo())
# Fails with TypeError: _create_udf() missing 1 required positional argument: 'f'
udfoo = udf(lambda: foo())
df = df.withColumn('word', udfoo())
# Works
I've managed to achieve the behaviour I want, so a "solution" is not exactly what I'm looking for (even though I welcome any suggestions for a better/more idiomatic way to implement this kind of thing). If anyone lands here looking for a "how to do it" answer, this other question might help.
What I'm really after in is an explanation: why should the first solution fail and the first work?
I'm using spark 2.4.0 and python 3.7.3 on Ubuntu 18.04.2
Upvotes: 2
Views: 2031
Reputation: 590
udf
expects a function to be passed to it, but when you call foo()
it evaluates immediately to a string.
You'll see the behavior you're expecting if you use udf(foo)
instead of udf(foo())
.
i.e.
udfoo = udf(foo)
df = df.withColumn('word', udfoo())
In case it helps, if you are trying to get a column that is just a constant value, you can use pyspark.sql.functions.lit
, like:
from pyspark.sql import functions as F
df.withColumn('word', F.lit('bar'))
Upvotes: 5