Reputation: 1587
I'm familiar with using UDFs to apply a custom function row-by-row to a DataFrame. However, I would like to know how to apply a custom function to different subsets of my DataFrame in parallel.
Here's a simplified example:
import numpy as np
import pandas as pd
dummy_data = pd.DataFrame({'id':np.random.choice(['a','b','c'],size=100),
'val':np.random.normal(size=100)})
My custom function takes an array of numbers as an input. For each unique 'id', I want to apply my function to the array of 'val' values associated with that id.
The simplistic way I'm doing it right now is to loop over my PySpark DataFrame, and for each 'id' convert the data to a pandas DataFrame, then apply the function. It works, but obviously it's slow and makes no use of spark.
How can I parallellise this?
Upvotes: 4
Views: 1186
Reputation: 611
This answer is so short that it should rather be a comment but not enough reputation to comment.
Spark 2.3 introduced pandas vectorized UDFs that are exactly what you're looking for: executing a custom pandas transformation over a grouped Spark DataFrame, in a distributed fashion, and with great performance thanks to PyArrow serialization.
See
for more information and examples.
Upvotes: 2