Neuronix
Neuronix

Reputation: 65

How to keep the maximum value of a column along with other columns in a pyspark dataframe?

Consider that I have this dataframe in pyspark:

+--------+----------------+---------+---------+
|DeviceID| TimeStamp      |range    | zipcode |
+--------+----------------+---------+---------
|   00236|11-03-2014 07:33|      4.5| 90041   |
|   00236|11-04-2014 05:43|      7.2| 90024   |
|   00236|11-05-2014 05:43|      8.5| 90026   |
|   00234|11-06-2014 05:55|      5.6| 90037   |
|   00234|11-01-2014 05:55|      9.2| 90032   |
|   00235|11-05-2014 05:33|      4.3| 90082   |
|   00235|11-02-2014 05:33|      4.3| 90029   |
|   00235|11-09-2014 05:33|      4.2| 90047   |
+--------+----------------+---------+---------+

How can I write a pyspark script to keep the maximum value of range column along with other columns in this pyspark dataframe? The output will be like this:

+--------+----------------+---------+---------+
|DeviceID| TimeStamp      |range    | zipcode |
+--------+----------------+---------+---------
|   00236|11-05-2014 05:43|      8.5| 90026   |
|   00234|11-01-2014 05:55|      9.2| 90032   |
|   00235|11-05-2014 05:33|      4.3| 90082   |
+--------+----------------+---------+---------+

Upvotes: 0

Views: 199

Answers (1)

Cena
Cena

Reputation: 3419

Using Window and row_number():

from pyspark.sql.window import Window

w=Window().partitionBy("DeviceID")

df.withColumn("rank", row_number().over(w.orderBy(desc("range"))))\
        .filter(col("rank")==1)\
        .drop("rank").show()

Output:

+--------+----------------+-----+-------+
|DeviceID|       TimeStamp|range|zipcode|
+--------+----------------+-----+-------+
|   00236|11-05-2014 05:43|  8.5|  90026|
|   00234|11-01-2014 05:55|  9.2|  90032|
|   00235|11-05-2014 05:33|  4.3|  90082|
+--------+----------------+-----+-------+

Upvotes: 1

Related Questions