Anubhav Jain
Anubhav Jain

Reputation: 351

How to groupy on one column and take minimum on other colums?

Input DataFrame

+----+---+----+--------+
|type|aud|code|priority|
+----+---+----+--------+
|   A| AA| qwe|       2|
|   B| AA| qwe|       1|
|   A| BB| qwe|       2|
|   B| BB| qwe|       1|
|   A| AA| jku|       2|
|   B| AA| jku|       1|
|   A| BB| jku|       2|
|   B| BB| jku|       1|
+----+---+----+--------+

Output DataFrame

+----+---+----+--------+
|type|aud|code|priority|
+----+---+----+--------+
|   B| AA| qwe|       1|
|   B| AA| jku|       1|
|   B| BB| qwe|       1|
|   B| BB| jku|       1|
+----+---+----+--------+

Problem description steps:

  1. First do groupBy on "aud" column

  2. Then in every group for each "code" value find its minimum value in "priority" column

  3. Suppose in AA group the min value of code "qwe" using "priority" column is 1 so take it out and filter other rows.

I am new to pyspark so can anyone tell how to do it using window functions or any other optimized way. Thanks for reading.

Upvotes: 0

Views: 123

Answers (1)

Aleh Pranovich
Aleh Pranovich

Reputation: 361

If you need to group by "aud" and then find the minimum "priority" for "code", you can group by the both "aud" and "code" columns and find the minimum "priority".

window_spec = Window \
    .partitionBy("aud", "code") \
    .orderBy("priority")

res = df \
    .withColumn("row_num", row_number().over(window_spec)) \
    .where(col("row_num") == 1) \
    .drop("row_num")

res.show()

Output:

+----+---+----+--------+
|type|aud|code|priority|
+----+---+----+--------+
|   B| AA| jku|       1|
|   B| AA| qwe|       1|
|   B| BB| qwe|       1|
|   B| BB| jku|       1|
+----+---+----+--------+

Upvotes: 2

Related Questions