How to calculate cumulative sum of a column based on Month column values

Question

I have a pyspark dataframe

Location    Month       Brand   Sector  TrueValue   PickoutValue
USA         1/1/2021    brand1  cars1   7418        30000       
USA         2/1/2021    brand1  cars1   1940        2000        
USA         3/1/2021    brand1  cars1   4692        2900        
USA         4/1/2021    brand1  cars1                           
USA         1/1/2021    brand2  cars2   16383104.2  16666667    
USA         2/1/2021    brand2  cars2   26812874.2  16666667    
USA         3/1/2021    brand2  cars2

I'm having Month values from 1/1/2021 to 12/1/2021 for each Brands. I need to create another column with cumulative sum of TrueValue column based on brand and sector and order by Month.

My expected dataframe is:

Location    Month       Brand   Sector  TrueValue   PickoutValue    TotalSumValue   
USA         1/1/2021    brand1  cars1   7418        30000           7418
USA         2/1/2021    brand1  cars1   1940        2000            9358
USA         3/1/2021    brand1  cars1   4692        2900            14050
USA         4/1/2021    brand1  cars1                               14050
USA         1/1/2021    brand2  cars2   16383104.2  16666667        16383104.2
USA         2/1/2021    brand2  cars2   26812874.2  16666667        43195978.4
USA         3/1/2021    brand2  cars2                               43195978.4

All columns are in string type and when I'm ordering by Month, after 1/1/2021 Im getting 10/1/2021, 11/1/2021, 12/1/2021, 2/1/2021 and so on. Hence, Im not getting the cumulative sum column values. I need to calculate the cumulative sum month wise from 1/1/2021 to 12/1/2021.

This is my code block

windowval = (Window.partitionBy('Brand','Sector').orderBy('Month')
             .rangeBetween(Window.unboundedPreceding, 0))
df1 = df1.withColumn('TotalSumValue', F.sum('TrueValue').over(windowval))

Drashti Dobariya · Accepted Answer

Convert your String date to timestamp using to_timestamp function of pyspark SQL function. Then sorting based on this timestamp column will give correct order.

from pyspark.sql.functions import to_timestamp

df.withColumn("month_in_timestamp", to_timestamp(df.Month, 'dd-MM-yyyy'))

windowval = (Window.partitionBy('Brand','Sector').orderBy('month_in_timestamp')
         .rangeBetween(Window.unboundedPreceding, 0))
df1 = df1.withColumn('TotalSumValue', F.sum('TrueValue').over(windowval))

How to calculate cumulative sum of a column based on Month column values

Answers (1)

Related Questions