Reputation: 7130
I have a compute intensive python function called repeatedly within a for loop (each iteration is independent i.e. embarrassingly parallel). I am looking for spark.lapply (from SparkR) kind of functionality to utilize the Spark cluster.
Upvotes: 0
Views: 620
Reputation: 751
Native Spark If you use Spark data frames and libraries, then Spark will natively parallelize and distribute your task.
Thread Pools One of the ways that you can achieve parallelism in Spark without using Spark data frames is by using the multiprocessing library.However, by default all of your code will run on the driver node.
Pandas UDFs One of the newer features in Spark that enables parallel processing is Pandas UDFs. With this feature, you can partition a Spark data frame into smaller data sets that are distributed and converted to Pandas objects, where your function is applied, and then the results are combined back into one large Spark data frame.
Example from https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
from pyspark.sql.functions import udf
# Use udf to define a row-at-a-time udf
@udf('double')
# Input/output are both a single double value
def plus_one(v):
return v + 1
df.withColumn('v2', plus_one(df.v))
Upvotes: 1