Reputation: 309
I am facing a code challenge in spark within Azure databricks. I have a dataset as
+----+-------------------+----------+--------+-----+---------------+
|OPID| Date|BaseAmount|Interval|Cycle|RateIncrease(%)|
+----+-------------------+----------+--------+-----+---------------+
| O1|2014-07-27 00:00:00| 4375| 12| 2| 2%|
| O2|2020-12-23 00:00:00| 4975| 7| 3| 5%|
+----+-------------------+----------+--------+-----+---------------+
I need to use a loop function which replicates the rows based on Interval, Cycle and RateIncrease(%). Interval and Cycle fields give me the number of rows to be replicated.
Number of rows = Interval * Cycle.
For OPID O1, the number of rows must be 24 (12 * 2).
After 12 months (12 rows) or after 1 cycle completion, the values of BaseAmount are going to increase by 2% for OPID O1 and this happens after every cycle. which should result in a table as below:
+----+-------------------+----------+--------+-----+---------------+
|OPID| Date|BaseAmount|Interval|Cycle|RateIncrease(%)|
+----+-------------------+----------+--------+-----+---------------+
| O1|2014-07-27 00:00:00| 4375| 12| 2| 2%|
| O1|2014-08-27 00:00:00| 4375| 12| 2| 2%|
| O1|2014-09-27 00:00:00| 4375| 12| 2| 2%|
| O1|2014-10-27 00:00:00| 4375| 12| 2| 2%|
| O1|2014-11-27 00:00:00| 4375| 12| 2| 2%|
| O1|2014-12-27 00:00:00| 4375| 12| 2| 2%|
| O1|2015-01-27 00:00:00| 4375| 12| 2| 2%|
| O1|2015-02-27 00:00:00| 4375| 12| 2| 2%|
| O1|2015-03-27 00:00:00| 4375| 12| 2| 2%|
| O1|2015-04-27 00:00:00| 4375| 12| 2| 2%|
| O1|2015-05-27 00:00:00| 4375| 12| 2| 2%|
| O1|2015-06-27 00:00:00| 4375| 12| 2| 2%|
| O1|2015-07-27 00:00:00| 4463| 12| 2| 2%|
| O1|2015-08-27 00:00:00| 4463| 12| 2| 2%|
| O1|2015-09-27 00:00:00| 4463| 12| 2| 2%|
| O1|2015-10-27 00:00:00| 4463| 12| 2| 2%|
| O1|2015-11-27 00:00:00| 4463| 12| 2| 2%|
| O1|2015-12-27 00:00:00| 4463| 12| 2| 2%|
| O1|2016-01-27 00:00:00| 4463| 12| 2| 2%|
| O1|2016-02-27 00:00:00| 4463| 12| 2| 2%|
| O1|2016-03-27 00:00:00| 4463| 12| 2| 2%|
| O1|2016-04-27 00:00:00| 4463| 12| 2| 2%|
| O1|2016-05-27 00:00:00| 4463| 12| 2| 2%|
| O1|2016-06-27 00:00:00| 4463| 12| 2| 2%|
| O2|2020-12-23 00:00:00| 4975| 7| 3| 5%|
.
.
.
+----+-------------------+----------+--------+-----+---------------+
I got the initial bits solved thanks to user @mck. How to insert a custom function within For loop in pyspark?
Thank you.
Upvotes: 0
Views: 162
Reputation: 42422
You can explode a sequence of cycles first, in order to get the changed BaseAmount, and then explode a sequence of timestamps as in my previous answer:
import pyspark.sql.functions as F
result = df.withColumn(
'cycle2',
F.expr("explode(sequence(0, int(Cycle)-1))")
).withColumn(
'BaseAmount',
F.expr("int(round(BaseAmount * (1 + 0.01*cycle2*float(trim('%' from `RateIncrease(%)`)))))")
).withColumn(
'Date',
F.expr("""
explode(
sequence(
timestamp(add_months(timestamp(Date), cycle2*(`Interval`))),
timestamp(add_months(timestamp(Date), (cycle2+1)*(`Interval`) - 1)),
interval 1 month
)
)
""")
).drop('cycle2')
result.show(999)
+----+-------------------+----------+--------+-----+---------------+
|OPID| Date|BaseAmount|Interval|Cycle|RateIncrease(%)|
+----+-------------------+----------+--------+-----+---------------+
| O1|2014-07-27 00:00:00| 4375| 12| 2| 2%|
| O1|2014-08-27 00:00:00| 4375| 12| 2| 2%|
| O1|2014-09-27 00:00:00| 4375| 12| 2| 2%|
| O1|2014-10-27 00:00:00| 4375| 12| 2| 2%|
| O1|2014-11-27 00:00:00| 4375| 12| 2| 2%|
| O1|2014-12-27 00:00:00| 4375| 12| 2| 2%|
| O1|2015-01-27 00:00:00| 4375| 12| 2| 2%|
| O1|2015-02-27 00:00:00| 4375| 12| 2| 2%|
| O1|2015-03-27 00:00:00| 4375| 12| 2| 2%|
| O1|2015-04-27 00:00:00| 4375| 12| 2| 2%|
| O1|2015-05-27 00:00:00| 4375| 12| 2| 2%|
| O1|2015-06-27 00:00:00| 4375| 12| 2| 2%|
| O1|2015-07-27 00:00:00| 4463| 12| 2| 2%|
| O1|2015-08-27 00:00:00| 4463| 12| 2| 2%|
| O1|2015-09-27 00:00:00| 4463| 12| 2| 2%|
| O1|2015-10-27 00:00:00| 4463| 12| 2| 2%|
| O1|2015-11-27 00:00:00| 4463| 12| 2| 2%|
| O1|2015-12-27 00:00:00| 4463| 12| 2| 2%|
| O1|2016-01-27 00:00:00| 4463| 12| 2| 2%|
| O1|2016-02-27 00:00:00| 4463| 12| 2| 2%|
| O1|2016-03-27 00:00:00| 4463| 12| 2| 2%|
| O1|2016-04-27 00:00:00| 4463| 12| 2| 2%|
| O1|2016-05-27 00:00:00| 4463| 12| 2| 2%|
| O1|2016-06-27 00:00:00| 4463| 12| 2| 2%|
| O2|2020-12-23 00:00:00| 4975| 7| 3| 5%|
| O2|2021-01-23 00:00:00| 4975| 7| 3| 5%|
| O2|2021-02-23 00:00:00| 4975| 7| 3| 5%|
| O2|2021-03-23 00:00:00| 4975| 7| 3| 5%|
| O2|2021-04-23 00:00:00| 4975| 7| 3| 5%|
| O2|2021-05-23 00:00:00| 4975| 7| 3| 5%|
| O2|2021-06-23 00:00:00| 4975| 7| 3| 5%|
| O2|2021-07-23 00:00:00| 5224| 7| 3| 5%|
| O2|2021-08-23 00:00:00| 5224| 7| 3| 5%|
| O2|2021-09-23 00:00:00| 5224| 7| 3| 5%|
| O2|2021-10-23 00:00:00| 5224| 7| 3| 5%|
| O2|2021-11-23 00:00:00| 5224| 7| 3| 5%|
| O2|2021-12-23 00:00:00| 5224| 7| 3| 5%|
| O2|2022-01-23 00:00:00| 5224| 7| 3| 5%|
| O2|2022-02-23 00:00:00| 5473| 7| 3| 5%|
| O2|2022-03-23 00:00:00| 5473| 7| 3| 5%|
| O2|2022-04-23 00:00:00| 5473| 7| 3| 5%|
| O2|2022-05-23 00:00:00| 5473| 7| 3| 5%|
| O2|2022-06-23 00:00:00| 5473| 7| 3| 5%|
| O2|2022-07-23 00:00:00| 5473| 7| 3| 5%|
| O2|2022-08-23 00:00:00| 5473| 7| 3| 5%|
+----+-------------------+----------+--------+-----+---------------+
Upvotes: 1