Reputation: 351
Input DataFrame
+----+---+----+--------+
|type|aud|code|priority|
+----+---+----+--------+
| A| AA| qwe| 2|
| B| AA| qwe| 1|
| A| BB| qwe| 2|
| B| BB| qwe| 1|
| A| AA| jku| 2|
| B| AA| jku| 1|
| A| BB| jku| 2|
| B| BB| jku| 1|
+----+---+----+--------+
Output DataFrame
+----+---+----+--------+
|type|aud|code|priority|
+----+---+----+--------+
| B| AA| qwe| 1|
| B| AA| jku| 1|
| B| BB| qwe| 1|
| B| BB| jku| 1|
+----+---+----+--------+
Problem description steps:
First do groupBy on "aud" column
Then in every group for each "code" value find its minimum value in "priority" column
Suppose in AA group the min value of code "qwe" using "priority" column is 1 so take it out and filter other rows.
I am new to pyspark so can anyone tell how to do it using window functions or any other optimized way. Thanks for reading.
Upvotes: 0
Views: 123
Reputation: 361
If you need to group by "aud" and then find the minimum "priority" for "code", you can group by the both "aud" and "code" columns and find the minimum "priority".
window_spec = Window \
.partitionBy("aud", "code") \
.orderBy("priority")
res = df \
.withColumn("row_num", row_number().over(window_spec)) \
.where(col("row_num") == 1) \
.drop("row_num")
res.show()
Output:
+----+---+----+--------+
|type|aud|code|priority|
+----+---+----+--------+
| B| AA| jku| 1|
| B| AA| qwe| 1|
| B| BB| qwe| 1|
| B| BB| jku| 1|
+----+---+----+--------+
Upvotes: 2