LaSul
LaSul

Reputation: 2411

Pyspark - Get cumulative sum of of a column with condition

I have a dataframe with cards, time and amount and I need to aggregate card's amount (sum and count) with a one month window.

Here are how data looks like :

+--------------------+-------------------+------------+
|            card_uid|               date|amount_local|
+--------------------+-------------------+------------+
|card_001H4Mw1Ha0M...|2016-05-04 17:54:30|        8.99|
|card_0026uGZQwZQd...|2016-05-06 12:16:18|       16.19|
|card_0026uGZQwZQd...|2016-07-06 12:17:57|       16.19|
|card_003STfrgB8SZ...|2016-12-04 10:05:21|        58.8|
|card_005gBxyiDc6b...|2016-09-10 18:58:25|       27.95|
|card_005gBxyiDc6b...|2016-11-12 11:18:29|       12.99|

This is what I've made so far.

+--------------------+-------------------+------------+----------------+
|            card_uid|               date|amount_local|duration_cum_sum|
+--------------------+-------------------+------------+----------------+
|card_001H4Mw1Ha0M...|2016-05-04 17:54:30|        8.99|            8.99|
|card_0026uGZQwZQd...|2016-05-06 12:16:18|       16.19|           16.19|
|card_0026uGZQwZQd...|2016-07-06 12:17:57|       16.19|           32.38|
|card_003STfrgB8SZ...|2016-12-04 10:05:21|        58.8|            58.8|
|card_005gBxyiDc6b...|2016-09-10 18:58:25|       27.95|           27.95|
|card_005gBxyiDc6b...|2016-11-12 11:18:29|       12.99|           40.94|

With window functions below :

partition = Window.partitionBy("card_uid").orderBy("date")

previousTransactionDate = data.withColumn("previous_tr_time", lag(data.date).over(partition)).select("transaction_id", "card_uid", "date", "previous_tr_time") 

df_cum_sum = data.withColumn("duration_cum_sum", sum('amount_local').over(partition))

df_cum_sum.orderBy("card_uid","date").select("card_uid", "date", "amount_local", "duration_cum_sum").show()

But the only thing I want to add is two things :

So the needed ouput looks like this :

+--------------------+-------------------+------------+----------------+
|            card_uid|               date|amount_local|duration_cum_sum|
+--------------------+-------------------+------------+----------------+
|card_001H4Mw1Ha0M...|2016-05-04 17:54:30|        8.99|               0|
|card_0026uGZQwZQd...|2016-05-06 12:16:18|       16.19|               0|
|card_0026uGZQwZQd...|2016-05-12 12:17:57|        4.00|           16.19|
|card_0026uGZQwZQd...|2016-06-06 12:23:51|       16.19|            4.00| => Only 4 because de 16.19 was more than one month ago
|card_003STfrgB8SZ...|2016-12-04 10:05:21|        58.8|               0|
|card_005gBxyiDc6b...|2016-09-10 18:58:25|       27.95|               0|
|card_005gBxyiDc6b...|2016-09-12 11:18:29|       12.99|           27.95| => Previous amount 
|card_005gBxyiDc6b...|2016-09-22 14:25:44|       23.99|           40.94| => 27.95 + 12.99

I can't groupBy card_uid since I need the same number of rows as the original to link to another table

Upvotes: 4

Views: 8955

Answers (1)

Manoj Singh
Manoj Singh

Reputation: 1737

You need a rolling window on date with window ranging from past 30 days to previous day. Since interval functions are not available for window, you can convert the dates into long values and use the days long value to create window range.

from pyspark.sql.functions import *
days = lambda i: i * 86400 

partition = Window.partitionBy("card_uid").orderBy(col("date").cast("timestamp").cast("long")).rangeBetween(days(-30), days(-1))

df_cum_sum = data.withColumn("duration_cum_sum",sum(col('amount_local')).over(partition))\
                 .fillna(0,subset=['duration_cum_sum'])
df_cum_sum.show()

Upvotes: 7

Related Questions