Reputation: 3
My Code:
df=temp_df.groupBy('date','id').count()
windowval = (Window.partitionBy('date','id').orderBy('date','id').rangeBetween(Window.unboundedPreceding, 0))
final_df = df.withColumn('cum_sum', F.sum('count').over(windowval)).orderBy('date','id').show()
Please correct my code, I believe something wrong in using Window(rangeBetween).
Thanks,
DF:
+-------------------+------------------+-----+
| date| id|count|
+-------------------+------------------+-----+
|2007-11-04 00:00:00| 5| 4|
|2007-11-05 00:00:00| 5| 7|
|2007-11-06 00:00:00| 5| 3|
|2007-11-06 00:00:00| 8| 3|
|2007-11-07 00:00:00| 5| 7|
|2007-11-08 00:00:00| 5| 2|
|2007-11-08 00:00:00| 8| 4|
+-------------------+------------------+-----+
Expected output:
+-------------------+------------------+-----+-------+
| date| id|count|cum_sum|
+-------------------+------------------+-----+-------+
|2007-11-04 00:00:00| 5| 4| 4|
|2007-11-05 00:00:00| 5| 7| 11|
|2007-11-06 00:00:00| 5| 3| 14|
|2007-11-06 00:00:00| 8| 3| 3|
|2007-11-07 00:00:00| 5| 7| 21|
|2007-11-08 00:00:00| 5| 2| 23|
|2007-11-08 00:00:00| 8| 4| 7|
+-------------------+------------------+-----+-------+
My Output:
+-------------------+------------------+-----+-------+
| date| id|count|cum_sum|
+-------------------+------------------+-----+-------+
|2007-11-04 00:00:00| 5| 4| 4|
|2007-11-05 00:00:00| 5| 7| 7|
|2007-11-06 00:00:00| 5| 3| 3|
|2007-11-06 00:00:00| 8| 3| 3|
|2007-11-07 00:00:00| 5| 7| 7|
|2007-11-08 00:00:00| 5| 2| 2|
|2007-11-08 00:00:00| 8| 4| 4|
+-------------------+------------------+-----+-------+
Upvotes: 0
Views: 304
Reputation: 7399
Just change your current code to:
df = temp_df.groupBy('date', 'id').count()
windowval = Window.partitionBy('id').orderBy('date').rangeBetween(Window.unboundedPreceding, 0)
final_df = df.withColumn('cum_sum', F.sum('count').over(windowval)).orderBy('date', 'id').show()
When you partition by id and date, every (id
, date
) combination is unique. You need to partition by id
and orderBy
date
Upvotes: 1