PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object in row passing to UDF

Question

The transactions_df is the DF I am running my UDF on and inside the UDF I am referencing another DF to get values from based on some conditions.

def convertRate(row):
    completed = row["completedAt"]
    currency = row["currency"]
    amount = row["amount"]
    if currency == "MXN":
        rate = currency_exchange_df.select("rate").where((transactions_df.to =="MXN") & (completed>=col("effectiveAt")) & (completed< col("effectiveTill")))
        amount = amount/rate
    final_rate = currency_exchange_df.select("rate").where((transactions_df.to =="CAD") & (completed>=col("effectiveAt")) & (completed< col("effectiveTill")))
    converted = amount*final_rate
    return converted

convertUDF = f.udf(lambda row: convertRate(row), DoubleType())

To call the UDF I am passing the Row as a struct. I got this solution from here.

temp = transactions_df.withColumn("newAmount", convertUDF(f.struct([transactions_df[x] for x in transactions_df.columns])))
temp.show()

I am getting error as follows:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~\AppData\Local\Programs\Python\Python310\lib\site-        packages\pyspark\serializers.py:437, in CloudPickleSerializer.dumps(self, obj)
    436 try:
--> 437     return cloudpickle.dumps(obj, pickle_protocol)
    438 except pickle.PickleError:

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pyspark\cloudpickle\cloudpickle_fast.py:72, in dumps(obj, protocol, buffer_callback)
     69 cp = CloudPickler(
     70     file, protocol=protocol, buffer_callback=buffer_callback
     71 )
---> 72 cp.dump(obj)
     73 return file.getvalue()

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pyspark\cloudpickle\cloudpickle_fast.py:540, in CloudPickler.dump(self, obj)
    539 try:
--> 540     return Pickler.dump(self, obj)
    541 except RuntimeError as e:

TypeError: cannot pickle '_thread.RLock' object

During handling of the above exception, another exception occurred:

PicklingError                             Traceback (most recent call last)
Input In [40], in ()
----> 1 temp = transactions_df.withColumn("newAmount", convertUDF(f.struct([transactions_df[x] for x in transactions_df.columns])))
  2 temp.show()

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pyspark\sql\udf.py:199, in UserDefinedFunction._wrapped..wrapper(*args)
    197 @functools.wraps(self.func, assigned=assignments)
    198 def wrapper(*args):
--> 199     return self(*args)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pyspark\sql\udf.py:177, in UserDefinedFunction.__call__(self, *cols)
    176 def __call__(self, *cols):
--> 177     judf = self._judf
    178     sc = SparkContext._active_spark_context
    179     return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pyspark\sql\udf.py:161, in UserDefinedFunction._judf(self)
    154 @property
    155 def _judf(self):
    156     # It is possible that concurrent access, to newly created UDF,
    157     # will initialize multiple UserDefinedPythonFunctions.
    158     # This is unlikely, doesn't affect correctness,
    159     # and should have a minimal performance impact.
    160     if self._judf_placeholder is None:
--> 161         self._judf_placeholder = self._create_judf()
    162     return self._judf_placeholder

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pyspark\sql\udf.py:170, in UserDefinedFunction._create_judf(self)
    167 spark = SparkSession.builder.getOrCreate()
    168 sc = spark.sparkContext
--> 170 wrapped_func = _wrap_function(sc, self.func, self.returnType)
    171 jdt = spark._jsparkSession.parseDataType(self.returnType.json())
    172 judf = sc._jvm.org.apache.spark.sql.execution.python.UserDefinedPythonFunction(
    173     self._name, wrapped_func, jdt, self.evalType, self.deterministic)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pyspark\sql\udf.py:34, in _wrap_function(sc, func, returnType)
     32 def _wrap_function(sc, func, returnType):
     33     command = (func, returnType)
---> 34     pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
     35     return sc._jvm.PythonFunction(bytearray(pickled_command), env, includes, sc.pythonExec,
     36                                   sc.pythonVer, broadcast_vars, sc._javaAccumulator)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pyspark\rdd.py:2814, in _prepare_for_python_RDD(sc, command)
   2811 def _prepare_for_python_RDD(sc, command):
   2812     # the serialized command will be compressed by broadcast
   2813     ser = CloudPickleSerializer()
-> 2814     pickled_command = ser.dumps(command)
   2815     if len(pickled_command) > sc._jvm.PythonUtils.getBroadcastThreshold(sc._jsc):  # Default 1M
   2816         # The broadcast will have same life cycle as created PythonRDD
   2817         broadcast = sc.broadcast(pickled_command)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pyspark\serializers.py:447, in CloudPickleSerializer.dumps(self, obj)
    445     msg = "Could not serialize object: %s: %s" % (e.__class__.__name__, emsg)
    446 print_exec(sys.stderr)
--> 447 raise pickle.PicklingError(msg)

    PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object

Sample DF are as follows:

The first DF is my transaction_df. The second DF contains the exchange rates.

The transactions provided are in either US dollars or Mexican Pesos and the currency exchange data contains only an "effectiveAt" date. It is assumed that the exchange rate will remain the same until a new record for a given rate is provided.

I have to convert all the transactions into CAD. Note we must first convert MXN to USD then USD to CAD.

The third DF is the expected solution.

PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object in row passing to UDF

Answers (1)

Related Questions

PicklingError: Could not serialize object: TypeError: cannot pickle &#39;_thread.RLock&#39; object in row passing to UDF

Answers (1)

Related Questions

PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object in row passing to UDF