Dataproc: functools.partial no attribute '__module__' error for pyspark UDF

Question

I am using GCP/Dataproc for some spark/graphframe calculations.

In my private spark/hadoop standalone cluster, I have no issue using functools.partial when defining pysparkUDF.

But, now with GCP/Dataproc, I have an issue as below.

Here are some basic settings to check whether partial works well or not.

import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial

def power(base, exponent):
    return base ** exponent

In the main function, functools.partial works well in ordinary cases as we expect:

# see whether partial works as it is
square = partial(power, exponent=2)
print "*** Partial test = ", square(2)

But, if I put this partial(power, exponent=2) function to PySparkUDF as below,

testSquareUDF = F.udf(partial(power, exponent=2),T.FloatType())    
testdf = inputdf.withColumn('pxsquare',testSquareUDF('px'))

I have this error message:

Traceback (most recent call last):
  File "/tmp/bf297080f57a457dba4d3b347ed53ef0/gcloudtest-partial-error.py", line 120, in 
    testSquareUDF = F.udf(square,T.FloatType())
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 1971, in udf
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 1955, in _udf
  File "/opt/conda/lib/python2.7/functools.py", line 33, in update_wrapper
    setattr(wrapper, attr, getattr(wrapped, attr))

AttributeError: 'functools.partial' object has no attribute '__module__'

ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [bf297080f57a457dba4d3b347ed53ef0] entered state [ERROR] while waiting for [DONE].

=========

I had no this kind of issue with my standalone cluster. My spark cluster version is 2.1.1. The GCP dataproc's is 2.2.x

Anyone can recognize what prevents me from passing the partial function to the UDF?

Dataproc: functools.partial no attribute 'module' error for pyspark UDF

Answers (1)

Related Questions

Dataproc: functools.partial no attribute &#39;__module__&#39; error for pyspark UDF

Answers (1)

Related Questions

Dataproc: functools.partial no attribute 'module' error for pyspark UDF