Udo
Udo

Reputation: 45

How to run parallel programs with pyspark?

I would like to use our spark cluster to run programs in parallel. My idea is to do sth like the following:

def simulate():
  #some magic happening in here
return 0

spark = (
SparkSession.builder
    .appName('my_simulation')
    .enableHiveSupport()
    .getOrCreate())

sc = spark.sparkContext

no_parallel_instances = sc.parallelize(xrange(500))
res = no_parallel_instances.map(lambda row: simulate())
print res.collect()

The question i have is whether there's a way to execute simulate() with different parameters. The only way i currently can imagine is to have a dataframe specifying the parameters, so something like this:

parameter_list = [[5,2.3,3], [3,0.2,4]]
no_parallel_instances = sc.parallelize(parameter_list)
res = no_parallel_instances.map(lambda row: simulate(row))
print res.collect()

Is there another, more elegant way to run parallel functions with spark?

Upvotes: 0

Views: 4444

Answers (1)

Ryan Widmaier
Ryan Widmaier

Reputation: 8523

If the data you are looking to parameterize your call with differs between each row, then yes you will need to include that with each row.

However, if you are looking to set global parameters that affect every row, then you can use a broadcast variable.

http://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables

Broadcast variables are created once in your script and cannot be modified after that. Spark will efficiently distribute those values to every executor to make them available to your transformations. To create one you provide the data to spark and it gives you back a handle you can use to access it on the executors. For example:

settings_bc = sc.broadcast({
   'num_of_donkeys': 3,
   'donkey_color': 'brown'
})

def simulate(settings, n):
    # do magic
    return n

no_parallel_instances = sc.parallelize(xrange(500))
res = no_parallel_instances.map(lambda row: simulate(settings_bc.value, row))
print res.collect()

Upvotes: 1

Related Questions