Reputation: 65
Consider that I have this dataframe in pyspark:
+--------+----------------+---------+---------+
|DeviceID| TimeStamp |range | zipcode |
+--------+----------------+---------+---------
| 00236|11-03-2014 07:33| 4.5| 90041 |
| 00236|11-04-2014 05:43| 7.2| 90024 |
| 00236|11-05-2014 05:43| 8.5| 90026 |
| 00234|11-06-2014 05:55| 5.6| 90037 |
| 00234|11-01-2014 05:55| 9.2| 90032 |
| 00235|11-05-2014 05:33| 4.3| 90082 |
| 00235|11-02-2014 05:33| 4.3| 90029 |
| 00235|11-09-2014 05:33| 4.2| 90047 |
+--------+----------------+---------+---------+
How can I write a pyspark script to keep the maximum value of range column along with other columns in this pyspark dataframe? The output will be like this:
+--------+----------------+---------+---------+
|DeviceID| TimeStamp |range | zipcode |
+--------+----------------+---------+---------
| 00236|11-05-2014 05:43| 8.5| 90026 |
| 00234|11-01-2014 05:55| 9.2| 90032 |
| 00235|11-05-2014 05:33| 4.3| 90082 |
+--------+----------------+---------+---------+
Upvotes: 0
Views: 199
Reputation: 3419
Using Window
and row_number()
:
from pyspark.sql.window import Window
w=Window().partitionBy("DeviceID")
df.withColumn("rank", row_number().over(w.orderBy(desc("range"))))\
.filter(col("rank")==1)\
.drop("rank").show()
Output:
+--------+----------------+-----+-------+
|DeviceID| TimeStamp|range|zipcode|
+--------+----------------+-----+-------+
| 00236|11-05-2014 05:43| 8.5| 90026|
| 00234|11-01-2014 05:55| 9.2| 90032|
| 00235|11-05-2014 05:33| 4.3| 90082|
+--------+----------------+-----+-------+
Upvotes: 1