Reputation: 103
I am trying to use elastic search package in Dataproc Serverless Spark pyspark job. I am facing issue only with this package in Dataproc Serverless.
import os
print("Current dir:", os.getcwd())
print("Current dir list:", os.listdir('.'))
import pandas
import statsmodels
import platform
import numpy
print("Python version:", platform.python_version())
print("Pandas version:", pandas.__version__)
print("statsmodel version:", statsmodels.__version__ )
print("Numpy version:", numpy.__version__)
#import elasticsearch as es
from elasticsearch import Elasticsearch as es
print("elasticsearch version:", es.__version__ )
Below is the output for this code.
Current dir: /tmp/srvls-batch-7554fe27-4044-4341-ae79-ffe9488ea385
Current dir list: ['pyspark_venv.tar.gz', '.test_sls.py.crc', 'test_sls.py']
Python version: 3.9.15
Pandas version: 1.4.4
statsmodel version: 0.13.5
Numpy version: 1.21.6
Traceback (most recent call last):
File "/tmp/srvls-batch-7554fe27-4044-4341-ae79-ffe9488ea385/test_sls.py", line 16, in <module>
from elasticsearch import Elasticsearch as es
ModuleNotFoundError: No module named 'elasticsearch'
I followed below steps to setup venv for this job,
https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html#using-virtualenv
and used --archives option while calling the job. Can anyone please correct me, if I am missing anything. Thanks in advance
Upvotes: 1
Views: 643
Reputation: 4457
When providing custom Python env (via --archives
or container image), you need to configure Spark to use it, instead of the default one.
To do this, you need to set PYSPARK_PYTHON
env var to point to Python binary in the custom Python env.
It can be done in container image script, or via Spark properties:
spark.dataproc.driverEnv.PYSPARK_PYTHON=...
spark.executorEnv.PYSPARK_PYTHON=...
Upvotes: 1