Reputation: 9
Here's a Docker Compose setup for a distributed Apache Spark environment using Bitnami's Spark image. It includes:
All services are connected via a custom network and use a shared volume for data.
Once I submit a task to the Spark Connect server on port 15002 from my local machine, the Spark master distributes the workload to the workers. After some time, I can see the output in the PyCharm console. However, the application continues to run on the master, and the workers keep processing it.
To resolve this, I need to manually kill the application. If I try to run a new application after this, I encounter a gRPC error with a status code of 2, indicating an unknown error.
Docker-compose file:
version: '3.8'
services:
spark-master:
image: bitnami/spark
container_name: spark-master
environment:
- SPARK_MODE=master
- SPARK_MASTER_WEBUI_PORT=8080
- SPARK_MASTER_PORT=7077
- SPARK_SUBMIT_OPTIONS=--packages io.delta:delta-spark_2.12:3.2.0
- SPARK_MASTER_HOST=spark-master
ports:
- 8080:8080
- 7077:7077
networks:
- spark-network
volumes:
- /mnt/f/Thesis_Docs/Project/spark:/mnt
spark-connect:
image: bitnami/spark
container_name: spark-connect
environment:
- SPARK_MODE=driver
- SPARK_MASTER=spark://spark-master:7077
ports:
- 15002:15002
networks:
- spark-network
depends_on:
- spark-master
command: ["/bin/bash", "-c", "/opt/bitnami/spark/sbin/start-connect-server.sh --master spark://spark-master:7077 --packages org.apache.spark:spark-connect_2.12:3.5.1"]
volumes:
- /mnt/f/Thesis_Docs/Project/spark:/mnt
spark-worker:
image: bitnami/spark
container_name: spark-worker
environment:
- SPARK_MODE=worker
- SPARK_MASTER=spark://spark-master:7077
- SPARK_WORKER_CORES=2
- SPARK_WORKER_MEMORY=2G
- SPARK_WORKER_WEBUI_PORT=8081
ports:
- 8081:8081
depends_on:
- spark-master
networks:
- spark-network
spark-worker2:
image: bitnami/spark
container_name: spark-worker2
environment:
- SPARK_MODE=worker
- SPARK_MASTER=spark://spark-master:7077
- SPARK_WORKER_CORES=2
- SPARK_WORKER_MEMORY=2G
- SPARK_WORKER_WEBUI_PORT=8082
ports:
- 8082:8082
depends_on:
- spark-master
networks:
- spark-network
networks:
spark-network:
Python Code Running on Pycharm on Host Machine:
from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()
# Perform your Spark operations
spark.range(15).show()
# Stop Spark session
spark.stop()
Master Web UI on Port 8080:
PyCharm Output Windows:
Errors for GRPC with status code 2. Once I kill the application which is running on the cluster through WebUI. I can't run any other application on same port.
pyspark.errors.exceptions.connect.SparkConnectGrpcException: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = ""
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-07-11T09:52:19.743877+00:00", grpc_status:2, grpc_message:""}"
The expected behavior is for the Spark master to coordinate the distribution of tasks to the Spark workers, and the Spark Connect server to allow for the submission of tasks from your local machine.
However, there are a few issues and points to address to ensure the system functions correctly and prevents the gRPC error with status code 2.
Also once the application is completed in pycharm and it shows the result in the pycharm console it should show the task in list of completed application on the WebUI too
Then I should be able to submit new application on the same port using any IDE.
Upvotes: 0
Views: 247