Reputation: 239

How to submit PySpark job on Kubernetes (minikube) using spark-submit

I have a PySpark job present locally on my laptop. If I want to submit it on my minikube cluster using spark-submit, any idea how to pass the python file ?

I'm using following command, but it isn't working

./spark-submit \
        --master k8s://https://192.168.64.6:8443 \
        --deploy-mode cluster \
        --name amazon-data-review \
        --conf spark.kubernetes.namespace=jupyter \
        --conf spark.executor.instances=1 \
        --conf spark.kubernetes.driver.limit.cores=1 \
        --conf spark.executor.cores=1 \
        --conf spark.executor.memory=500m \
        --conf spark.kubernetes.container.image=prateek/spark-ubuntu-2.4.5 \
        --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
        --conf spark.kubernetes.container.image.pullPolicy=Always \
        --conf spark.kubernetes.container.image.pullSecrets=dockerlogin \
        --conf spark.eventLog.enabled=true \
        --conf spark.eventLog.dir=s3a://prateek/spark-hs/ \
        --conf spark.hadoop.fs.s3a.access.key=xxxxx \
        --conf spark.hadoop.fs.s3a.secret.key=xxxxx \
        --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
        --conf spark.hadoop.fs.s3a.fast.upload=true \
        /Users/prateek/apache-spark/amazon_data_review.py

Getting following error -

python3: can't open file '/Users/prateek/apache-spark/amazon_data_review.py': [Errno 2] No such file or directory

Is it required to keep the file within the Docker image itself. Can't we run it locally by keeping it on laptop

Upvotes: 2

Answers (2)

Alex Sasnouskikh

Reputation: 991

Spark on Kubernetes doesn't support submitting locally stored files with spark-submit.

What you could do to make it work in cluster mode is to build Spark Docker image based on prateek/spark-ubuntu-2.4.5 with amazon_data_review.py put inside of it (eg using Docker COPY /Users/prateek/apache-spark/amazon_data_review.py /amazon_data_review.py statement).

Then just refer to it in the spark-submit command using local:// file system, eg.:

spark-submit \
  --master ... \
  --conf ... \
  ...
  local:///amazon_data_review.py

The alternative is to store that file on http(s):// or hdfs://-like accessible location.

Upvotes: 4

Prateek Dubey

Reputation: 239

It's solved. Running it with client mode helped to run it

--deploy-mode client

Upvotes: -1

How to submit PySpark job on Kubernetes (minikube) using spark-submit

Answers (2)

Related Questions