pyspark - aggregation

Question

Say, I have a dataframe as below

mid | bid | m_date1   | m_date2 | m_date3   |
100 | ws  |           |         | 2022-02-01|
200 | gs  | 2022-02-01|         |           |

Now I have an sql aggregation as below

SELECT
mid,
bid,
min(NEXT(m_date1, 'SAT')) as dat1,
min(NEXT(m_date2, 'SAT')) as dat2,
min(NEXT(m_date3, 'SAT')) as dat3
FROM df
GROUPBY 1,2

I am looking to implement above aggregation using Pyspark but wondering if I can use any form of iteration to achieve dat1, dat2 and dat3 as same 'min' function is applied on those columns. I could use below aggregation syntax in PySpark for each column but I am looking to avoid repeating the 'min' function on each aggregated column.

df.groupBy('mid','bid').agg(...)

Thank you

wwnde · Accepted Answer

A sample output would have been better. If I got you right you are after

df.groupby('mid','bid').agg(*[min(i).alias(f"min{i}") for i in df.drop('mid','bid').columns]).show()

pyspark - aggregation

Answers (1)

Related Questions