Reputation: 2238
I am building a process for launching user built queries (business rules) using Scala-Spark/SQL. One of the requirements is that if the SQLs perform slower than expected (every rule has an expected performance (time in seconds) attribute), I need to flag them as such for future references, as well as kill the long running (slow) process/job,
So far I am thinking of the following approach -
I am concerned that I am fiddling with the distributed nature of the job. Another concern is that, for my "job" (that of running that query), spark internally will launch an unknown number of tasks across nodes, how will the timing process work, what kind of actual performance shall be reported back to my program!!
Suggestions please..
Upvotes: 2
Views: 751
Reputation: 1161
I suggest different approach: build streaming / scheduled batch application
that updates state to DB
upon new input data arrival, then provide rest api
to access that state according to query range required by client. From my experience allowing clients to launch series of spark jobs
will expose you to huge operational overhead while managing their performance and volume -> effect of cluster resources. It is easier to tune and monitor and productionise your queries re: partitions
, cores / executor nos
- optimal cluster resources and manage query rest api. In case, this is not suitable for you: build rest api by allowing user to launch own spark - job
per each query, examples: spark hidden api 1, spark hidden api 1 , spark job server, and then build app that monitors spark ui
and kills -> relaunches if it is too long, you can use that script as example kill spark job via spark ui. Your approach planned might very hard to execute as spark launches multiple future
jobs itself, from which many are lazy implemented, and timing each stage of execution it is pretty hard, perhaps you can use future
to launch spark job
per client query
and monitor its length? I hope it helps
Upvotes: 1