lazarea
lazarea

Reputation: 1349

Installing Python package across Spark executors - not finding python package, raising ModuleNotFoundError

I have a question regarding the correct way to install new packages on Spark worker nodes, using Databricks and Mlflow.

What I currently have are the following:

  1. a training script (using cv2, i.e. opencv2-python library), which logs the tuned ML model, together with dependencies, on mlflow Model Registry
  2. an inference script which reads the logged ML model together with the saved conda environment, as a spark_udf
  3. an installation step which reads the conda environment and installs all packages to the required version, via pip install (wrapped in a subprocess.call)
  4. and then the actual prediction where the spark_udf is called on new data.

The last step is what fails with ModuleNotFoundError.

SparkException: Job aborted due to stage failure: Task 8 in stage 95.0 
failed 4 times, most recent failure: Lost task 8.3 in stage 95.0 (TID 577 
(xxx.xxx.xxx.xxx executor 0): org.apache.spark.api.python.PythonException: 
'ModuleNotFoundError: No module named 'cv2''

I have been closely following the content of this article which seems to cover this same problem:

So it seems that at the moment, despite using spark_udf and conda environment logged to mlflow, installation of the cv2 module only happened on my driver node, but not on the worker nodes. If this is true, I now need to programatically specify these extra dependencies (namely, the cv2 Python module), to the executors/worker nodes.

So what I did was, importing the cv2 in my inference script, and retrieving the path of the cv2's init file, and adding it to spark context, similarly to how it is done for the arbitrary "A.py" file in the blog post above.

import cv2
spark.sparkContext.addFile(os.path.abspath(cv2.__file__))

This doesn't seem to do any change though. I assume the reason is, partly, that I want to add not just a single __init__.py file, but make the entire cv2 library accessible to the worker nodes; however, the above solution seems to only do it for the __init__.py. I'm certain that adding all files in all submodules of cv2 is also not the way to go, but I haven't been able to figure out how I could achieve this easily, with a similar command as the addFile() above.

Similarly, I also tried the other option, addPyFile(), by pointing it to the cv2 module's root (parent of __init__):

import cv2
spark.sparkContext.addPyFile(os.path.dirname(cv2.__file__))

but this also didn't help, and I still got stuck with the same error. Furthermore, I would like this process to be automatic, i.e. not having to manually set module paths in the inference code.

Similar posts I came across to:

Upvotes: 2

Views: 2136

Answers (1)

Habib Karbasian
Habib Karbasian

Reputation: 676

I have encountered the same issue in my project where my code requires some specific packages in all of the executer nodes not in the head node. This is a very tricky issue with DBX (Databricks).

The virtual environment functionality in DBX is implemented as notebook-scoped meaning whatever package you would like to use for that notebook, you should install it at the beginning of the notebook. But there is a catch here regarding the LTS version of the cluster:

  1. Databricks Runtime 11.3 LTS and above: %pip, %sh pip, and !pip to install libraries

  2. Databricks Runtime 10.4 LTS: %pip to install libraries

  3. Databricks Runtime 9.1 LTS: There is not a notebook-scoped functionality in this version instead it has cluster level functionality meaning that you have to install each package on your cluster individually (e.g. pandas==1.4.2) or make a wheel and install the wheel. This case is the most cumbersome situation, and I highly recommend upgrading to 11.3 and above to have that functionality.

You can learn more about it here.

Upvotes: 0

Related Questions