Nabih Bawazir
Nabih Bawazir

Reputation: 7255

How to make cumulative sum still integer in pyspark

Here's my current output

+---+------------+
| ix|last_x_month|
+---+------------+
|  1|         1.0|
|  1|         2.0|
|  1|         3.0|
|  1|         4.0|
|  1|         5.0|
+---+------------+

Here's my code

import sys
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = df.withColumn('last_x_month', F.sum(datamonthly.ix).over(Window.partitionBy().orderBy().rowsBetween(-sys.maxsize, 0)))

Here's my expected output (still integer)

+---+------------+
| ix|last_x_month|
+---+------------+
|  1|           1|
|  1|           2|
|  1|           3|
|  1|           4|
|  1|           5|
+---+------------+

Note: I also already try convert to Integer by using datamonthly.withColumn("last_x_month",datamonthly.last_x_month.cast(IntegerType()))

and still give similar output

Upvotes: 0

Views: 48

Answers (1)

Vaebhav
Vaebhav

Reputation: 5032

Cast is working fine -

Data Preparation

input_list = [(1.0,),(1.0,),(1.0,),(1.0,),(1.0,)]

sparkDF = sql.createDataFrame(input_list, ['ix'])

sparkDF.show()

+---+
| ix|
+---+
|1.0|
|1.0|
|1.0|
|1.0|
|1.0|
+---+

Window & Cast

window = Window.partitionBy().orderBy().rowsBetween(-sys.maxsize, 0)

sparkDF = sparkDF.withColumn('last_x_month', F.sum('ix').over(window))

to_convert = ['ix','last_x_month']

sparkDF = reduce(lambda df, x: df.withColumn(f'{x}_int',F.col(x).cast(IntegerType())), to_convert, sparkDF)

sparkDF.show()

+---+------------+------+----------------+
| ix|last_x_month|ix_int|last_x_month_int|
+---+------------+------+----------------+
|1.0|         1.0|     1|               1|
|1.0|         2.0|     1|               2|
|1.0|         3.0|     1|               3|
|1.0|         4.0|     1|               4|
|1.0|         5.0|     1|               5|
+---+------------+------+----------------+

Upvotes: 1

Related Questions