Starbucks
Starbucks

Reputation: 1568

Pyspark - Drop Duplicates of group and keep first row

How do I take the max(value) of df.value, drop duplicate values of df.max_value (keeping first) within the same day and by groups?

+---+-------------------+-----+----------+
| id|               date|value| date_only|
+---+-------------------+-----+----------+
| J6|2019-10-01 00:00:00| Null|2016-10-01| 
| J6|2019-10-01 01:00:00|    1|2016-10-01|
| J6|2019-10-01 12:30:30|    3|2016-10-01|
| J6|2019-10-01 12:30:30|    3|2016-10-01|
| J2|2019-10-06 00:00:00|    9|2016-10-06|
| J2|2019-10-06 09:20:00|    9|2016-10-06|
| J2|2019-10-06 09:20:00|    1|2016-10-06|
| J2|2019-10-06 09:20:00|    9|2016-10-06|
+---+-------------------+-----+----------+

Desired Dataframe:

+---+-------------------+-----+----------+---------+
| id|               date|value| date_only|max_value|
+---+-------------------+-----+----------+---------+
| J6|2019-10-01 00:00:00| Null|2016-10-01|        3|
| J6|2019-10-01 01:00:00|    1|2016-10-01|     Null|
| J6|2019-10-01 12:30:30|    3|2016-10-01|     Null|
| J6|2019-10-01 12:30:30|    3|2016-10-01|     Null|
| J2|2019-10-06 00:00:00|    9|2016-10-06|        9|
| J2|2019-10-06 09:20:00|    9|2016-10-06|     Null|
| J2|2019-10-06 09:20:00|    1|2016-10-06|     Null|
| J2|2019-10-06 09:20:00|    9|2016-10-06|     Null|
+---+-------------------+-----+----------+---------+

Upvotes: 1

Views: 1590

Answers (1)

Cena
Cena

Reputation: 3419

Using a combination of max() and row_number():

from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.window import Window

w=Window().partitionBy("id", "date_only").orderBy("date")

df.withColumn('max_value', F.when(F.row_number().over(w)==1, F.max('value')\
        .over(Window().partitionBy("id", "date_only")))).show()

+---+-------------------+-----+----------+---------+
| id|               date|value| date_only|max_value|
+---+-------------------+-----+----------+---------+
| J6|2019-10-01 00:00:00| null|2016-10-01|        3|
| J6|2019-10-01 01:00:00|    1|2016-10-01|     null|
| J6|2019-10-01 12:30:30|    3|2016-10-01|     null|
| J6|2019-10-01 12:30:30|    3|2016-10-01|     null|
| J2|2019-10-06 00:00:00|    9|2016-10-06|        9|
| J2|2019-10-06 09:20:00|    9|2016-10-06|     null|
| J2|2019-10-06 09:20:00|    1|2016-10-06|     null|
| J2|2019-10-06 09:20:00|    9|2016-10-06|     null|
+---+-------------------+-----+----------+---------+

Upvotes: 1

Related Questions