Parallellise a custom function with PySpark

Question

I'm familiar with using UDFs to apply a custom function row-by-row to a DataFrame. However, I would like to know how to apply a custom function to different subsets of my DataFrame in parallel.

Here's a simplified example:

import numpy as np
import pandas as pd

dummy_data = pd.DataFrame({'id':np.random.choice(['a','b','c'],size=100),
                           'val':np.random.normal(size=100)})

My custom function takes an array of numbers as an input. For each unique 'id', I want to apply my function to the array of 'val' values associated with that id.

The simplistic way I'm doing it right now is to loop over my PySpark DataFrame, and for each 'id' convert the data to a pandas DataFrame, then apply the function. It works, but obviously it's slow and makes no use of spark.

How can I parallellise this?

Parallellise a custom function with PySpark

Answers (1)

Related Questions