Koko191
Koko191

Reputation: 58

Is it possible to run many Spark jobs in parallel using a fixed pool of spark contexts?

I am new to Spark so any suggestion, either about relevant tools or recommended design change for my use case, would be appreciated.

My current situation is that I have a few million independent Spark jobs that do not take very long to run (a couple seconds on average) and I'm using Livy to submit them in batch mode. The issue is that the time it takes to initialize Spark contexts for each job is way longer than the time it takes to run the jobs themselves. So my idea is to initialize a fix pool of spark contexts and use them to run all these jobs instead of getting a new context every time a job is run.

The thing is I'm completely new to Spark and have no idea if this is possible or a good idea to follow through. I tried using a few Livy sessions but multiple statements cannot be executed simultaneously on one Livy session so I'm stuck.

Upvotes: 1

Views: 321

Answers (1)

Yosi Dahari
Yosi Dahari

Reputation: 6999

Each spark application has its own associated driver, that runs in a dedicated process. In this process the context is singleton.

Depending on what is your source, you can use one of the following options:

  1. Use streaming, or structured streaming
  2. Use thrift server (Mainly apply to JDBC/ODBC)

So in short, you need to find a way to minimize the number of applications you run.

Upvotes: 1

Related Questions