Mateusz Dymczyk
Mateusz Dymczyk

Reputation: 15141

Total number of jobs in a Spark App

I already saw this question How to implement custom job listener/tracker in Spark? and checked the source code to find out how to get the number of stages per job but is there any way to track programatically the % of jobs that got completed in a Spark app?

I can probably get the number of finished jobs with the listeners but I'm missing the total number of jobs that will be run.

I want to track progress of the whole app and it creates quite a few jobs but I can't find to find it anywhere.

@Edit: I know there's a REST endpoint for getting all the jobs in an app but:

  1. I would prefer not to use REST but to get it in the app itself (spark running on AWS EMR/Yarn - getting the address probably is doable but I'd prefer to not do it)
  2. that REST endpoint seems to be returning only jobs that are running/finished/failed so not total number of jobs.

Upvotes: 4

Views: 4139

Answers (1)

Mateusz Dymczyk
Mateusz Dymczyk

Reputation: 15141

After going through the source code a bit I guess there's no way to see upfront how many jobs will there be since I couldn't find any place where Spark would be doing such analysis upfront (as jobs are submitted in each action independently Spark doesn't have a big picture of all the jobs from the start).

This kind of makes sense because of how Spark divides work into:

  • jobs - which are started whenever the code which is run on the driver node encounters an action (i.e. collect(), take() etc.) and are supposed to compute a value and return it to the driver
  • stages - which are composed of sequences of tasks between which no data shuffling is required
  • tasks - computations of the same type which can run in parallel on worker nodes

So we do need to know stages and tasks upfront for a single job to create the DAG but we don't necessarily need to create a DAG of jobs, we can just create them "as we go".

Upvotes: 3

Related Questions