Reputation: 13
i have dataset
user | date | loc |
---|---|---|
a | 2021-01-01 | 1 |
a | 2021-01-02 | 1 |
a | 2021-01-03 | 2 |
a | 2021-01-04 | 2 |
a | 2021-01-05 | 1 |
a | 2021-01-06 | 1 |
i am trying to end up with
user | startdate | enddate | loc |
---|---|---|---|
a | 2021-01-01 | 2021-01-02 | 1 |
a | 2021-01-03 | 2021-01-04 | 2 |
a | 2021-01-05 | 2021-01-06 | 1 |
I have tried window partition over[user,loc].order by date row_number and I tried lag check if prevrow loc = current loc and then increment the value but I am stuck. does anyone have a solution to this grouping issue. Not sure why it is so hard for me to figure out
thanks for any help
Upvotes: 1
Views: 2633
Reputation: 71689
Here is one way to approach the problem
group
column to distinguish between the consecutive rows in loc
per user
user
, loc
and group
and aggregate the column date
using min
and max
group
column and sort the dataframe by startdate
w = Window.partitionBy('user').orderBy('date')
b = F.lag('loc').over(w) != F.col('loc')
(
df.withColumn('group', b.cast('int'))
.fillna(0, 'group')
.withColumn('group', F.sum('group').over(w))
.groupBy('user', 'loc', 'group')
.agg(F.min('date').alias('startdate'),
F.max('date').alias('enddate'))
.drop('group')
.orderBy('startdate')
)
+----+---+----------+----------+
|user|loc| startdate| enddate|
+----+---+----------+----------+
| a| 1|2021-01-01|2021-01-02|
| a| 2|2021-01-03|2021-01-04|
| a| 1|2021-01-05|2021-01-06|
+----+---+----------+----------+
Upvotes: 2