How to run parallel programs with pyspark?

Question

I would like to use our spark cluster to run programs in parallel. My idea is to do sth like the following:

def simulate():
  #some magic happening in here
return 0

spark = (
SparkSession.builder
    .appName('my_simulation')
    .enableHiveSupport()
    .getOrCreate())

sc = spark.sparkContext

no_parallel_instances = sc.parallelize(xrange(500))
res = no_parallel_instances.map(lambda row: simulate())
print res.collect()

The question i have is whether there's a way to execute simulate() with different parameters. The only way i currently can imagine is to have a dataframe specifying the parameters, so something like this:

parameter_list = [[5,2.3,3], [3,0.2,4]]
no_parallel_instances = sc.parallelize(parameter_list)
res = no_parallel_instances.map(lambda row: simulate(row))
print res.collect()

Is there another, more elegant way to run parallel functions with spark?

How to run parallel programs with pyspark?

Answers (1)

Related Questions