How to groupy on one column and take minimum on other colums?

Question

Input DataFrame

+----+---+----+--------+
|type|aud|code|priority|
+----+---+----+--------+
|   A| AA| qwe|       2|
|   B| AA| qwe|       1|
|   A| BB| qwe|       2|
|   B| BB| qwe|       1|
|   A| AA| jku|       2|
|   B| AA| jku|       1|
|   A| BB| jku|       2|
|   B| BB| jku|       1|
+----+---+----+--------+

Output DataFrame

+----+---+----+--------+
|type|aud|code|priority|
+----+---+----+--------+
|   B| AA| qwe|       1|
|   B| AA| jku|       1|
|   B| BB| qwe|       1|
|   B| BB| jku|       1|
+----+---+----+--------+

Problem description steps:

First do groupBy on "aud" column
Then in every group for each "code" value find its minimum value in "priority" column
Suppose in AA group the min value of code "qwe" using "priority" column is 1 so take it out and filter other rows.

I am new to pyspark so can anyone tell how to do it using window functions or any other optimized way. Thanks for reading.

Aleh Pranovich · Accepted Answer

If you need to group by "aud" and then find the minimum "priority" for "code", you can group by the both "aud" and "code" columns and find the minimum "priority".

window_spec = Window \
    .partitionBy("aud", "code") \
    .orderBy("priority")

res = df \
    .withColumn("row_num", row_number().over(window_spec)) \
    .where(col("row_num") == 1) \
    .drop("row_num")

res.show()

Output:

+----+---+----+--------+
|type|aud|code|priority|
+----+---+----+--------+
|   B| AA| jku|       1|
|   B| AA| qwe|       1|
|   B| BB| qwe|       1|
|   B| BB| jku|       1|
+----+---+----+--------+

How to groupy on one column and take minimum on other colums?

Answers (1)

Related Questions