Reputation: 58
I am new to Spark so any suggestion, either about relevant tools or recommended design change for my use case, would be appreciated.
My current situation is that I have a few million independent Spark jobs that do not take very long to run (a couple seconds on average) and I'm using Livy to submit them in batch mode. The issue is that the time it takes to initialize Spark contexts for each job is way longer than the time it takes to run the jobs themselves. So my idea is to initialize a fix pool of spark contexts and use them to run all these jobs instead of getting a new context every time a job is run.
The thing is I'm completely new to Spark and have no idea if this is possible or a good idea to follow through. I tried using a few Livy sessions but multiple statements cannot be executed simultaneously on one Livy session so I'm stuck.
Upvotes: 1
Views: 321
Reputation: 6999
Each spark application has its own associated driver, that runs in a dedicated process. In this process the context is singleton.
Depending on what is your source, you can use one of the following options:
So in short, you need to find a way to minimize the number of applications you run.
Upvotes: 1