pyspark tasks stuck on Airflow and Spark Standalone Cluster with Docker-compose

Question

I setup Airflow and Spark standalone cluster on docker-compose. Airflow run spark-submit tasks via spark client mode, which are submitted directly to spark master. However when I execute spark-submit task, the task got stuck.

Spark-submit Command:

spark-submit --verbose --master spark:7077 --name dummy_sql_spark_job ${AIRFLOW_HOME}/dags/spark/spark_sql.py

What i see from spark-submit driver logs:

22/01/04 07:02:19 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220104070012-0011/1 is now EXITED (Command exited with code 1)
22/01/04 07:02:19 INFO StandaloneSchedulerBackend: Executor app-20220104070012-0011/1 removed: Command exited with code 1
22/01/04 07:02:19 INFO BlockManagerMaster: Removal of executor 1 requested
22/01/04 07:02:19 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
22/01/04 07:02:19 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
22/01/04 07:02:19 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20220104070012-0011/5 on worker-20220104061702-172.27.0.9-38453 (172.27.0.9:38453) with 1 core(s)
22/01/04 07:02:19 INFO StandaloneSchedulerBackend: Granted executor ID app-20220104070012-0011/5 on hostPort 172.27.0.9:38453 with 1 core(s), 1024.0 MiB RAM
22/01/04 07:02:19 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220104070012-0011/5 is now RUNNING
22/01/04 07:02:28 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:02:43 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:02:58 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:03:13 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:03:28 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:03:43 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

What i see from one of the spark workers:

spark-worker-1_1  | 22/01/04 07:02:18 INFO SecurityManager: Changing modify acls groups to:
spark-worker-1_1  | 22/01/04 07:02:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(spark); groups with view permissions: Set(); users  with modify permissions: Set(spark); groups with modify permissions: Set()
spark-worker-1_1  | 22/01/04 07:02:19 INFO ExecutorRunner: Launch command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=5001" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@172.27.0.6:5001" "--executor-id" "3" "--hostname" "172.27.0.11" "--cores" "1" "--app-id" "app-20220104070012-0011" "--worker-url" "spark://Worker@172.27.0.11:35093"

Versions:

Airflow image: apache/airflow:2.2.3

Spark driver version: 3.1.2

Spark server: 3.2.0

Network

All containers airflow-scheduler, airflow-webserver, spark-master, spark-worker-n connected to same external network.

spark-driver is installed under airflow containers (scheduler, webserver), because corresponding dags and tasks are executed by airflow-scheduler.

UPDATE

After replacing driver spark version to match the master's one 3.2.0, the issue get disappeared. So it means, that in my particular case the issue was not due to connectivity between different spark actors (driver, master, worker/executor), but due to version mismatch. For some reason spark workers does not log corresponding error, which is misleading.

Rustem K · Accepted Answer

Most of the threads was pointing to connectivity issues. However in my case issue was due to mismatch of spark's driver vs master/worker version.

After replacing driver spark version to match the master's one 3.2.0, as well as ensure the same python version both on driver and executor sides (3.9.10) the issue get disappeared. So it means, that in my particular case the issue was not due to connectivity between different spark actors (driver, master, worker/executor), but due to version mismatch. For some reason spark workers does not log corresponding error, which is misleading.

pyspark tasks stuck on Airflow and Spark Standalone Cluster with Docker-compose

Versions:

Network

UPDATE

Answers (1)

Related Questions