How to distribute python map() function over cores in Databricks?

Question

I have a function below which made some customers per customer using a fixed DataFrame.

def calculate_fun(customer):
  """
  Instead of loop
  """
  
  result_output = main_calculation_fun(DataFrame_1[['ColA', ColB', 'ColC', 'ColD']], DataFrame_2, customer)
    
  return pd.DataFrame(result_output)

Currently I run the code like this:

  all_result = list(map(calculate_fun, customers['customer_id'].unique().tolist()))

I am using DataBricks and can see I have 150+ cores and 400gb RAM. How can I distribute the customer_ids over the cores? Because right now it runs for 2+ days.

I tried the following:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()


df = spark.createDataFrame(data = pd.DataFrame(unique_contracts, columns = ['customer_id']))
DataFrame_1 = spark.createDataFrame(data = DataFrame_1)
DataFrame_2 = spark.createDataFrame(data = DataFrame_2)

def reformat(partitionData):
    for row in partitionData:
        df_result = main_calculation_fun(DataFrame_1[['ColA', 'ColB', 'ColC', 'ColD']], DataFrame_2, row.customer_id)
    
    return pd.DataFrame(df_result)
        

df2=df.rdd.mapPartitions(reformat).toDF(["A","B", "C", "D", "E"])

PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object

How to distribute python map() function over cores in Databricks?

Answers (1)

applyInPandas

Multithreading

Related Questions