Reputation: 863
I am running a spark direct stream from kafka where I need to run many concurrent jobs in order to process all the data in time. In spark you can set spark.streaming.concurrentJobs
to a number of concurrent jobs you want to run.
What I want to know is a logical way to determine how many concurrent jobs I can run within my given environment. For privacy issues at my company, I cannot tell you the specs that I have, but what I would want to know is which specs are relevant in determining a limit and why?
Of course the alternative is that I could keep increasing it and testing, then adjusting based on results but I would like a more logical approach and I want to actually understand what determines that limit and why.
Upvotes: 0
Views: 621
Reputation: 13515
To test different numbers of concurrent jobs and see the overall execution time is the most reliable method. However, I suppose the best number roughly equals to the value of Runtime.getRuntime().availableProcessors();
So my advice is to start with that number of available processors, then increase and decrease it by 1,2, and 3. Then make a chart (execution time against the number of jobs) and you'll see the optimal number of jobs.
Upvotes: 1