Reputation: 11
I am using GCP/Dataproc for some spark/graphframe calculations.
In my private spark/hadoop standalone cluster,
I have no issue using functools.partial
when defining pysparkUDF.
But, now with GCP/Dataproc, I have an issue as below.
Here are some basic settings to check whether partial
works well or not.
import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
def power(base, exponent):
return base ** exponent
In the main function, functools.partial
works well in ordinary cases as we expect:
# see whether partial works as it is
square = partial(power, exponent=2)
print "*** Partial test = ", square(2)
But, if I put this partial(power, exponent=2)
function to PySparkUDF as below,
testSquareUDF = F.udf(partial(power, exponent=2),T.FloatType())
testdf = inputdf.withColumn('pxsquare',testSquareUDF('px'))
I have this error message:
Traceback (most recent call last):
File "/tmp/bf297080f57a457dba4d3b347ed53ef0/gcloudtest-partial-error.py", line 120, in <module>
testSquareUDF = F.udf(square,T.FloatType())
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 1971, in udf
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 1955, in _udf
File "/opt/conda/lib/python2.7/functools.py", line 33, in update_wrapper
setattr(wrapper, attr, getattr(wrapped, attr))
AttributeError: 'functools.partial' object has no attribute '__module__'
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [bf297080f57a457dba4d3b347ed53ef0] entered state [ERROR] while waiting for [DONE].
=========
I had no this kind of issue with my standalone cluster. My spark cluster version is 2.1.1. The GCP dataproc's is 2.2.x
Anyone can recognize what prevents me from passing the partial
function to the UDF?
Upvotes: 0
Views: 349
Reputation: 793
As discussed in the comments, the issue was with spark 2.2. And, since spark 2.3 is also supported by Dataproc, just using --image-version=1.3
when creating the cluster fixes it.
Upvotes: 1