Reputation: 434
I have a pyspark dataframe
Location Month Brand Sector TrueValue PickoutValue
USA 1/1/2021 brand1 cars1 7418 30000
USA 2/1/2021 brand1 cars1 1940 2000
USA 3/1/2021 brand1 cars1 4692 2900
USA 4/1/2021 brand1 cars1
USA 1/1/2021 brand2 cars2 16383104.2 16666667
USA 2/1/2021 brand2 cars2 26812874.2 16666667
USA 3/1/2021 brand2 cars2
I'm having Month values from 1/1/2021 to 12/1/2021 for each Brands. I need to create another column with cumulative sum of TrueValue column based on brand and sector and order by Month.
My expected dataframe is:
Location Month Brand Sector TrueValue PickoutValue TotalSumValue
USA 1/1/2021 brand1 cars1 7418 30000 7418
USA 2/1/2021 brand1 cars1 1940 2000 9358
USA 3/1/2021 brand1 cars1 4692 2900 14050
USA 4/1/2021 brand1 cars1 14050
USA 1/1/2021 brand2 cars2 16383104.2 16666667 16383104.2
USA 2/1/2021 brand2 cars2 26812874.2 16666667 43195978.4
USA 3/1/2021 brand2 cars2 43195978.4
All columns are in string type and when I'm ordering by Month, after 1/1/2021 Im getting 10/1/2021, 11/1/2021, 12/1/2021, 2/1/2021 and so on. Hence, Im not getting the cumulative sum column values. I need to calculate the cumulative sum month wise from 1/1/2021 to 12/1/2021.
This is my code block
windowval = (Window.partitionBy('Brand','Sector').orderBy('Month')
.rangeBetween(Window.unboundedPreceding, 0))
df1 = df1.withColumn('TotalSumValue', F.sum('TrueValue').over(windowval))
Upvotes: 0
Views: 565
Reputation: 3006
Convert your String date to timestamp using to_timestamp function of pyspark SQL function. Then sorting based on this timestamp column will give correct order.
from pyspark.sql.functions import to_timestamp
df.withColumn("month_in_timestamp", to_timestamp(df.Month, 'dd-MM-yyyy'))
windowval = (Window.partitionBy('Brand','Sector').orderBy('month_in_timestamp')
.rangeBetween(Window.unboundedPreceding, 0))
df1 = df1.withColumn('TotalSumValue', F.sum('TrueValue').over(windowval))
Upvotes: 1