Reputation: 1568
How do I take the max(value) of df.value
, drop duplicate values of df.max_value
(keeping first) within the same day and by groups?
+---+-------------------+-----+----------+
| id| date|value| date_only|
+---+-------------------+-----+----------+
| J6|2019-10-01 00:00:00| Null|2016-10-01|
| J6|2019-10-01 01:00:00| 1|2016-10-01|
| J6|2019-10-01 12:30:30| 3|2016-10-01|
| J6|2019-10-01 12:30:30| 3|2016-10-01|
| J2|2019-10-06 00:00:00| 9|2016-10-06|
| J2|2019-10-06 09:20:00| 9|2016-10-06|
| J2|2019-10-06 09:20:00| 1|2016-10-06|
| J2|2019-10-06 09:20:00| 9|2016-10-06|
+---+-------------------+-----+----------+
Desired Dataframe:
+---+-------------------+-----+----------+---------+
| id| date|value| date_only|max_value|
+---+-------------------+-----+----------+---------+
| J6|2019-10-01 00:00:00| Null|2016-10-01| 3|
| J6|2019-10-01 01:00:00| 1|2016-10-01| Null|
| J6|2019-10-01 12:30:30| 3|2016-10-01| Null|
| J6|2019-10-01 12:30:30| 3|2016-10-01| Null|
| J2|2019-10-06 00:00:00| 9|2016-10-06| 9|
| J2|2019-10-06 09:20:00| 9|2016-10-06| Null|
| J2|2019-10-06 09:20:00| 1|2016-10-06| Null|
| J2|2019-10-06 09:20:00| 9|2016-10-06| Null|
+---+-------------------+-----+----------+---------+
Upvotes: 1
Views: 1590
Reputation: 3419
Using a combination of max()
and row_number()
:
from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.window import Window
w=Window().partitionBy("id", "date_only").orderBy("date")
df.withColumn('max_value', F.when(F.row_number().over(w)==1, F.max('value')\
.over(Window().partitionBy("id", "date_only")))).show()
+---+-------------------+-----+----------+---------+
| id| date|value| date_only|max_value|
+---+-------------------+-----+----------+---------+
| J6|2019-10-01 00:00:00| null|2016-10-01| 3|
| J6|2019-10-01 01:00:00| 1|2016-10-01| null|
| J6|2019-10-01 12:30:30| 3|2016-10-01| null|
| J6|2019-10-01 12:30:30| 3|2016-10-01| null|
| J2|2019-10-06 00:00:00| 9|2016-10-06| 9|
| J2|2019-10-06 09:20:00| 9|2016-10-06| null|
| J2|2019-10-06 09:20:00| 1|2016-10-06| null|
| J2|2019-10-06 09:20:00| 9|2016-10-06| null|
+---+-------------------+-----+----------+---------+
Upvotes: 1