Conflict between dill and pickle while using pyspark

Question

Currently pyspark uses 2.4.0 version as part of conda installation. pip installation allows to use a later version of pyspark which is 3.1.2. but using this version, dill library has conflicts with pickle library.

i use this for unit test for pyspark. If I import dill library in test script, or any other test which imports dill which is run along with the pyspark test using pytest, it breaks.

The error it gives the below given error.

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/pyspark/serializers.py", line 437, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
  File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 101, in dumps
    cp.dump(obj)
  File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
    return Pickler.dump(self, obj)
  File "/opt/conda/lib/python3.6/pickle.py", line 409, in dump
    self.save(obj)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.6/pickle.py", line 751, in save_tuple
    save(element)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 722, in save_function
    *self._dynamic_function_reduce(obj), obj=obj
  File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 659, in _save_reduce_pickle5
    dictitems=dictitems, obj=obj
  File "/opt/conda/lib/python3.6/pickle.py", line 610, in save_reduce
    save(args)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.6/pickle.py", line 751, in save_tuple
    save(element)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.6/pickle.py", line 736, in save_tuple
    save(element)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.6/site-packages/dill/_dill.py", line 1146, in save_cell
    f = obj.cell_contents
ValueError: Cell is empty

This happens in /opt/conda/lib/python3.6/pickle.py file in save function. After persistent id and memo check it tries to get the type of the obj and if that is ‘cell’ class, it tries to get the details of it in the next line using self.dispatch.get function. On using pyspark 2.4.0 returns ‘None’ and it works well but on using pyspark 3.1.2, it returns an object and it forces the object to use save_reduce function. It is unable to save it since the cell is empty. Eg: ,

If we force the return value to be None for pyspark 3.1.2 installation, it works, but that needs to happen gracefully, than by hardcoding.

Anyone had this issue ? any suggestion on using which versions of dill, pickle and pyspark to use together.

here is the code that is being used

import pytest
from pyspark.sql import SparkSession
import dill # if this line is added, test does not work with pyspark-3.1.2

simpleData = [
    ("James", "Sales", "NY", 90000, 34, 10000),
]

schema = ["A", "B", "C", "D", "E", "F"]


@pytest.fixture(scope="session")
def start_session(request):
    spark = (
        SparkSession.builder.master("local[1]")
        .appName("Python Spark unit test")
        .getOrCreate()
    )
    yield spark
    spark.stop()


def test_simple_rdd(start_session):

    rdd = start_session.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7])
    assert rdd.stdev() == 2.0

This works with pyspark 2.4.0 but does not work with pyspark 3.1.2 with the above given error.

dill version - 0.3.1.1 pickle version - 4.0 python - 3.6

Conflict between dill and pickle while using pyspark

Answers (1)

Related Questions