Reputation: 1542
How do I submit multiple Spark jobs in parallel using Python's joblib
library?
I also want to do a "save" or "collect" in every job so I need to reuse the same Spark Context between the jobs.
Upvotes: 0
Views: 2897
Reputation: 1542
Here is an example to run multiple independent spark jobs in parallel without waiting for the first one to finish.
Other Approaches
One caveat is that Spark FAIR scheduling must be set.
This solution uses threads instead of different processes so that
from pyspark.sql.functions import udf, col, mean
from pyspark.sql.types import IntegerType, LongType
from joblib import Parallel, delayed
import pandas as pd
import random
lst = list(range(10, 100))
# Define functions operate on a single value from a column
def multiply(a):
return a * random.randint(10, 100)
def foo(i):
# This is the key point here, many different spark collect/save/show can be run here
# This is the function that parallelizing can help to speed up multiple independent jobs
return spark.createDataFrame(range(0, i), LongType()).select(mean(multiply(col("value"))).alias("value"))
parallel_job_count = 10
# Use "threads" to allow the same spark object to be reused between the jobs.
results = Parallel(n_jobs=parallel_job_count, prefer="threads")(delayed(foo)(i) for i in lst)
# Collect and print the results
mean_of_means = pd.concat([result.toPandas() for result in results]).value.mean()
print(f"Mean of Means: {mean_of_means}")
Upvotes: 4