How to use a GROUPED_MAP pandas udf in every Spark Dataframe partition?

Question

I would like to use a pandas UDF to speed up a user defined function. The type of pandas udf I am interested in the one that gets a pandas DataFrame as input and returns a Pandas DataFrame (the PandasUDFType.GROUPED_MAP).

However it seems that these pandas UDFs must be inserted in a groupby().apply() framework, while in my case I simply would like to apply the pandas UDF to every partition of the Pyspark Dataframe, with the idea of transforming each partition into a local Pandas Dataframe in each executor. In fact, I would like to avoid any type of groupby because this would lead to some data reshuffling.

Is there a way to achieve this, maybe by specifically saying that the groupby should be done by partition or something similar?

How to use a GROUPED_MAP pandas udf in every Spark Dataframe partition?

Answers (1)

Related Questions