Reputation: 406
I am submitting a spark job to Livy through an AWS lambda function. The job runs to the end of the driver program but then does not shutdown.
If spark.stop() or sc.stop() is added to the end of the driver program, the spark job finishes on the YARN resource manager and Livy will report a success. However, there is still a livy process running on the master node which takes around 1.5Gb of memory. If many jobs are submitted this eventually uses and holds all of the master node memory.
The job:
Pulls records from a hive table
Collects these records on the master node and then writes them to a pdf file using apache pdfbox
Uploads the resulting PDF to S3
Directly running spark-submit on the cluster produces the same results, however if I ctrl+c whilst the spark-submit job is running, the process on the master node is ended.
We are expecting the job to finish by itself when reaching the end of the driver program. If not this, the shutdown hook should be called when spark.stop() is called.
Upvotes: 0
Views: 833
Reputation: 838
Have you tried to enable this flag on the Spark configurations? spark.yarn.submit.waitAppCompletion=false
What I observed is that Livy does a spark-submit command. And the above flag makes sure that the command completes once Yarn application creates an applicationId.
Upvotes: 1