wesley
wesley

Reputation: 13

pyspark window min(date) and max(date) of group

i have dataset

user date loc
a 2021-01-01 1
a 2021-01-02 1
a 2021-01-03 2
a 2021-01-04 2
a 2021-01-05 1
a 2021-01-06 1

i am trying to end up with

user startdate enddate loc
a 2021-01-01 2021-01-02 1
a 2021-01-03 2021-01-04 2
a 2021-01-05 2021-01-06 1

I have tried window partition over[user,loc].order by date row_number and I tried lag check if prevrow loc = current loc and then increment the value but I am stuck. does anyone have a solution to this grouping issue. Not sure why it is so hard for me to figure out

thanks for any help

Upvotes: 1

Views: 2633

Answers (1)

Shubham Sharma
Shubham Sharma

Reputation: 71689

Here is one way to approach the problem

  • Create a helper group column to distinguish between the consecutive rows in loc per user
  • Then group the dataframe by the columns user, loc and group and aggregate the column date using min and max
  • Drop the group column and sort the dataframe by startdate
w = Window.partitionBy('user').orderBy('date')
b = F.lag('loc').over(w) != F.col('loc')

(
    df.withColumn('group', b.cast('int'))
      .fillna(0, 'group')
      .withColumn('group', F.sum('group').over(w))
      .groupBy('user', 'loc', 'group')
      .agg(F.min('date').alias('startdate'), 
           F.max('date').alias('enddate'))
      .drop('group')
      .orderBy('startdate')
)

+----+---+----------+----------+
|user|loc| startdate|   enddate|
+----+---+----------+----------+
|   a|  1|2021-01-01|2021-01-02|
|   a|  2|2021-01-03|2021-01-04|
|   a|  1|2021-01-05|2021-01-06|
+----+---+----------+----------+

Upvotes: 2

Related Questions