pyspark window min(date) and max(date) of group

Question

i have dataset

user	date	loc
a	2021-01-01	1
a	2021-01-02	1
a	2021-01-03	2
a	2021-01-04	2
a	2021-01-05	1
a	2021-01-06	1

i am trying to end up with

user	startdate	enddate	loc
a	2021-01-01	2021-01-02	1
a	2021-01-03	2021-01-04	2
a	2021-01-05	2021-01-06	1

I have tried window partition over[user,loc].order by date row_number and I tried lag check if prevrow loc = current loc and then increment the value but I am stuck. does anyone have a solution to this grouping issue. Not sure why it is so hard for me to figure out

thanks for any help

Shubham Sharma · Accepted Answer

Here is one way to approach the problem

Create a helper group column to distinguish between the consecutive rows in loc per user
Then group the dataframe by the columns user, loc and group and aggregate the column date using min and max
Drop the group column and sort the dataframe by startdate

w = Window.partitionBy('user').orderBy('date')
b = F.lag('loc').over(w) != F.col('loc')

(
    df.withColumn('group', b.cast('int'))
      .fillna(0, 'group')
      .withColumn('group', F.sum('group').over(w))
      .groupBy('user', 'loc', 'group')
      .agg(F.min('date').alias('startdate'), 
           F.max('date').alias('enddate'))
      .drop('group')
      .orderBy('startdate')
)

+----+---+----------+----------+
|user|loc| startdate|   enddate|
+----+---+----------+----------+
|   a|  1|2021-01-01|2021-01-02|
|   a|  2|2021-01-03|2021-01-04|
|   a|  1|2021-01-05|2021-01-06|
+----+---+----------+----------+

pyspark window min(date) and max(date) of group

Answers (1)

Related Questions