Reputation: 21
Currently pyspark uses 2.4.0 version as part of conda installation. pip installation allows to use a later version of pyspark which is 3.1.2. but using this version, dill library has conflicts with pickle library.
i use this for unit test for pyspark. If I import dill library in test script, or any other test which imports dill which is run along with the pyspark test using pytest, it breaks.
The error it gives the below given error.
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/pyspark/serializers.py", line 437, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 101, in dumps
cp.dump(obj)
File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
return Pickler.dump(self, obj)
File "/opt/conda/lib/python3.6/pickle.py", line 409, in dump
self.save(obj)
File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/conda/lib/python3.6/pickle.py", line 751, in save_tuple
save(element)
File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 722, in save_function
*self._dynamic_function_reduce(obj), obj=obj
File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 659, in _save_reduce_pickle5
dictitems=dictitems, obj=obj
File "/opt/conda/lib/python3.6/pickle.py", line 610, in save_reduce
save(args)
File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/conda/lib/python3.6/pickle.py", line 751, in save_tuple
save(element)
File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/conda/lib/python3.6/pickle.py", line 736, in save_tuple
save(element)
File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/conda/lib/python3.6/site-packages/dill/_dill.py", line 1146, in save_cell
f = obj.cell_contents
ValueError: Cell is empty
This happens in /opt/conda/lib/python3.6/pickle.py file in save function. After persistent id and memo check it tries to get the type of the obj and if that is ‘cell’ class, it tries to get the details of it in the next line using self.dispatch.get function. On using pyspark 2.4.0 returns ‘None’ and it works well but on using pyspark 3.1.2, it returns an object and it forces the object to use save_reduce function. It is unable to save it since the cell is empty. Eg: <cell at 0x7f0729a2as66: empty>,
If we force the return value to be None for pyspark 3.1.2 installation, it works, but that needs to happen gracefully, than by hardcoding.
Anyone had this issue ? any suggestion on using which versions of dill, pickle and pyspark to use together.
here is the code that is being used
import pytest
from pyspark.sql import SparkSession
import dill # if this line is added, test does not work with pyspark-3.1.2
simpleData = [
("James", "Sales", "NY", 90000, 34, 10000),
]
schema = ["A", "B", "C", "D", "E", "F"]
@pytest.fixture(scope="session")
def start_session(request):
spark = (
SparkSession.builder.master("local[1]")
.appName("Python Spark unit test")
.getOrCreate()
)
yield spark
spark.stop()
def test_simple_rdd(start_session):
rdd = start_session.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7])
assert rdd.stdev() == 2.0
This works with pyspark 2.4.0 but does not work with pyspark 3.1.2 with the above given error.
dill version - 0.3.1.1 pickle version - 4.0 python - 3.6
Upvotes: 2
Views: 1287
Reputation: 35247
Apparently you aren't using dill
except to import it. I assume you will be using it later...? As I mentioned in my comment, cloudpickle
and dill
do have some mild conflicts, and this appears to be what you are experiencing. Both serializers add logic to the pickle registry to tell python how to serialize different kinds of objects. So, if you use both dill
and cloudpickle
, there can be conflicts as the pickle registry is a dict
-- so the order of import and etc matters.
The issue is similar to as noted here:
https://github.com/tensorflow/tfx/issues/2090
There's a few things you can try:
(1) some codes allow you to replace the serializer. So, if you are able replace cloudpickle
for dill
, then that may resolve the conflicts. I'm not sure this can be done with pyspark, but there is a pyspark module on serializers, so that is promising...
Set PySpark Serializer in PySpark Builder
(2) dill
provides a mechanism to help mitigate some of the conflicts in the pickle registry. If you use dill.extend(False)
before using cloudpickle
, then dill.extend(True)
before using dill
, it may clear up the issue you are seeing.
Upvotes: 2