Reputation: 63
I have a standalone Spark cluster with a master and one worker (for now). The app runs as an Azure Container App - an app for the master, and another for the worker. Both apps are in the same namespace.
It runs a single application (reads from Kafka, simple transformation, writes to Delta Table). When I start the application I see the executor and driver in the dashboard (1 core, 2 GB ram for both executor and driver). After 5 minutes the executor is killed leaving only the driver alive. I have tried multiple times - always 5 minutes.
Master log:
2025-03-07T07:18:03.8025281Z stderr F 25/03/07 07:18:03 INFO Master: Registering worker 100.100.206.177:65000 with 10 cores, 7.0 GiB RAM
2025-03-07T07:18:08.6740120Z stderr F 25/03/07 07:18:08 INFO Master: Driver submitted org.apache.spark.deploy.worker.DriverWrapper
2025-03-07T07:18:08.6744812Z stderr F 25/03/07 07:18:08 INFO Master: Launching driver driver-20250307071808-0002 on worker worker-20250307071803-100.100.206.177-65000
2025-03-07T07:18:11.6502394Z stderr F 25/03/07 07:18:11 INFO Master: Registering app AppName
2025-03-07T07:18:11.6509960Z stderr F 25/03/07 07:18:11 INFO Master: Registered app AppName with ID app-20250307071811-0002
2025-03-07T07:18:11.6516823Z stderr F 25/03/07 07:18:11 INFO Master: Start scheduling for app app-20250307071811-0002 with rpId: 0
2025-03-07T07:18:11.6520393Z stderr F 25/03/07 07:18:11 INFO Master: Launching executor app-20250307071811-0002/0 on worker worker-20250307071803-100.100.206.177-65000
2025-03-07T07:18:11.7802035Z stderr F 25/03/07 07:18:11 INFO Master: Start scheduling for app app-20250307071811-0002 with rpId: 0
2025-03-07T07:18:14.1149857Z stderr F 25/03/07 07:18:14 INFO Master: 100.100.0.179:40548 got disassociated, removing it.
2025-03-07T07:18:14.1151561Z stderr F 25/03/07 07:18:14 INFO Master: 100.100.206.177:40175 got disassociated, removing it.
2025-03-07T07:23:11.7824346Z stderr F 25/03/07 07:23:11 INFO Master: 100.100.0.36:58476 got disassociated, removing it.
2025-03-07T07:23:11.7825329Z stderr F 25/03/07 07:23:11 INFO Master: ca-app-worker-bdso2eczojhja--confinsubmit-ff59f4b75-v8lm4:38129 got disassociated, removing it.
2025-03-07T07:23:11.7825770Z stderr F 25/03/07 07:23:11 INFO Master: Removing app app-20250307071811-0002
2025-03-07T07:23:11.8980486Z stderr F 25/03/07 07:23:11 WARN Master: Got status update for unknown executor app-20250307071811-0002/0
Executor log:
2025-03-07T07:20:56.04669 Successfully Connected to container: 'ca-app-worker-bdso2eczojhja' [Revision: 'ca-app-worker-bdso2eczojhja--confinsubmit-ff59f4b75-v8lm4', Replica: 'ca-app-worker-bdso2eczojhja--confinsubmit']
2025-03-07T07:18:08.5551767Z stderr F 25/03/07 07:18:08 INFO Utils: Successfully started service 'driverClient' on port 40175.
2025-03-07T07:18:08.6040221Z stderr F 25/03/07 07:18:08 INFO TransportClientFactory: Successfully created connection to ca-app-master-bdso2eczojhja/100.100.229.245:7077 after 22 ms (0 ms spent in bootstraps)
2025-03-07T07:18:08.6720836Z stderr F 25/03/07 07:18:08 INFO ClientEndpoint: ... waiting before polling master for driver state
2025-03-07T07:18:08.6896218Z stderr F 25/03/07 07:18:08 INFO ClientEndpoint: Driver successfully submitted as driver-20250307071808-0002
2025-03-07T07:18:08.7077359Z stderr F 25/03/07 07:18:08 INFO Worker: Asked to launch driver driver-20250307071808-0002
2025-03-07T07:18:08.7390778Z stderr F 25/03/07 07:18:08 INFO DriverRunner: Copying user jar file:/opt/spark/work-dir/sparkjobs-0.1-all.jar to /opt/spark/work/driver-20250307071808-0002/sparkjobs-0.1-all.jar
2025-03-07T07:18:08.7532736Z stderr F 25/03/07 07:18:08 INFO Utils: Copying /opt/spark/work-dir/sparkjobs-0.1-all.jar to /opt/spark/work/driver-20250307071808-0002/sparkjobs-0.1-all.jar
2025-03-07T07:18:08.9767253Z stderr F 25/03/07 07:18:08 INFO DriverRunner: Launch Command: "/opt/java/openjdk/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*:/etc/hadoop/conf" "-Xmx2048M" "-Dspark.dynamicAllocation.enabled=false" "-Dspark.master=spark://ca-app-master-bdso2eczojhja:7077" "-Dspark.driver.memory=2G" "-Dspark.network.timeout=600s" "-Dspark.submit.deployMode=cluster" "-Dspark.shuffle.compress=true" "-Dspark.executor.memory=2G" "-Dspark.app.name=dk.name.sparkjobs.EventsStream" "-Dspark.cores.max=1" "-Dspark.driver.supervise=false" "-Dspark.jars=file:/opt/spark/work-dir/sparkjobs-0.1-all.jar" "-Dspark.submit.pyFiles=" "-Dspark.executor.cores=1" "-Dspark.app.submitTime=1741331888113" "-Dspark.rpc.askTimeout=10s" "org.apache.spark.deploy.worker.DriverWrapper" "spark://[email protected]:65000" "/opt/spark/work/driver-20250307071808-0002/sparkjobs-0.1-all.jar" "dk.name.sparkjobs.EventsStream"
2025-03-07T07:18:11.7022027Z stderr F 25/03/07 07:18:11 INFO Worker: Asked to launch executor app-20250307071811-0002/0 for AppName
2025-03-07T07:18:11.7179607Z stderr F 25/03/07 07:18:11 INFO SecurityManager: Changing view acls to: spark
2025-03-07T07:18:11.7185133Z stderr F 25/03/07 07:18:11 INFO SecurityManager: Changing modify acls to: spark
2025-03-07T07:18:11.7188098Z stderr F 25/03/07 07:18:11 INFO SecurityManager: Changing view acls groups to:
2025-03-07T07:18:11.7189881Z stderr F 25/03/07 07:18:11 INFO SecurityManager: Changing modify acls groups to:
2025-03-07T07:18:11.7192879Z stderr F 25/03/07 07:18:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: spark; groups with view permissions: EMPTY; users with modify permissions: spark; groups with modify permissions: EMPTY
2025-03-07T07:18:11.7391135Z stderr F 25/03/07 07:18:11 INFO ExecutorRunner: Launch command: "/opt/java/openjdk/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*:/etc/hadoop/conf" "-Xmx2048M" "-Dspark.network.timeout=600s" "-Dspark.driver.port=38129" "-Dspark.rpc.askTimeout=10s" "-Djava.net.preferIPv6Addresses=false" "-XX:+IgnoreUnrecognizedVMOptions" "--add-opens=java.base/java.lang=ALL-UNNAMED" "--add-opens=java.base/java.lang.invoke=ALL-UNNAMED" "--add-opens=java.base/java.lang.reflect=ALL-UNNAMED" "--add-opens=java.base/java.io=ALL-UNNAMED" "--add-opens=java.base/java.net=ALL-UNNAMED" "--add-opens=java.base/java.nio=ALL-UNNAMED" "--add-opens=java.base/java.util=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED" "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED" "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" "--add-opens=java.base/sun.nio.cs=ALL-UNNAMED" "--add-opens=java.base/sun.security.action=ALL-UNNAMED" "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED" "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" "-Djdk.reflect.useDirectMethodHandle=false" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@ca-app-worker-bdso2eczojhja--confinsubmit-ff59f4b75-v8lm4:38129" "--executor-id" "0" "--hostname" "100.100.206.177" "--cores" "1" "--app-id" "app-20250307071811-0002" "--worker-url" "spark://[email protected]:65000" "--resourceProfileId" "0"
2025-03-07T07:18:13.7192983Z stderr F 25/03/07 07:18:13 INFO ClientEndpoint: State of driver-20250307071808-0002 is RUNNING
2025-03-07T07:18:13.7198964Z stderr F 25/03/07 07:18:13 INFO ClientEndpoint: Driver running on 100.100.206.177:65000 (worker-20250307071803-100.100.206.177-65000)
2025-03-07T07:18:13.7203498Z stderr F 25/03/07 07:18:13 INFO ClientEndpoint: spark-submit not configured to wait for completion, exiting spark-submit JVM.
2025-03-07T07:18:13.7576048Z stderr F 25/03/07 07:18:13 INFO ShutdownHookManager: Shutdown hook called
2025-03-07T07:18:13.7585522Z stderr F 25/03/07 07:18:13 INFO ShutdownHookManager: Deleting directory /tmp/spark-c4eef00a-1dc8-4168-bc51-3ca6e32c4525
2025-03-07T07:23:11.7877695Z stderr F 25/03/07 07:23:11 INFO Worker: Asked to kill executor app-20250307071811-0002/0
2025-03-07T07:23:11.7884040Z stderr F 25/03/07 07:23:11 INFO ExecutorRunner: Runner thread for executor app-20250307071811-0002/0 interrupted
2025-03-07T07:23:11.7893527Z stderr F 25/03/07 07:23:11 INFO ExecutorRunner: Killing process!
2025-03-07T07:23:11.8963907Z stderr F 25/03/07 07:23:11 INFO Worker: Executor app-20250307071811-0002/0 finished with state KILLED exitStatus 143
2025-03-07T07:23:11.8972549Z stderr F 25/03/07 07:23:11 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 0
2025-03-07T07:23:11.8976360Z stderr F 25/03/07 07:23:11 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20250307071811-0002, execId=0)
2025-03-07T07:23:11.8995390Z stderr F 25/03/07 07:23:11 INFO Worker: Cleaning up local directories for application app-20250307071811-0002
2025-03-07T07:23:11.8998084Z stderr F 25/03/07 07:23:11 INFO ExternalShuffleBlockResolver: Application app-20250307071811-0002 removed, cleanupLocalDirs = true
I have tried with spark.network.timeout=600s
and spark.dynamicAllocation.enabled=false
- none of which have changed anything. During the 5 minutes the job works fine reading and writing data. After the 5 minutes I still see the driver and worker in the dashboard.
Any ideas why the executor is killed?
Edit: If I run the application in client mode rather than cluster, then it works fine
Upvotes: 0
Views: 34
Reputation: 2068
Your executor was killed because your executor received by a SIGTERM from your driver and it shut down gracefully (exit code 143). There are many potential causes of triggering exit code 143 and I believe it's because of Memory / GC issues in your case. You can validate it by checking the Spark UI.
Not sure how complicated your transformation is, if you still have resource in your cluster, increase your executor memory. If not, reduce the maxOffsetsPerTrigger
.
Upvotes: 0