Reputation: 333
Currently I'm performing some calculations on a database that contains information on how loans are paid by borrowers.
And my goal is to create a new dataframe that will include loans that fall under the following criteria:
Here is my dataframe:
Name ID ContractDate LoanSum ClosingDate
A ID1 2022-10-10 10 2022-10-15
A ID1 2022-10-16 8 2022-10-25
A ID1 2022-10-27 25
B ID2 2022-12-12 10 2022-10-15
B ID2 2022-12-16 22 2022-11-18
B ID2 2022-12-20 9 2022-11-25
B ID2 2023-11-29 13
C ID3 2022-11-11 30 2022-11-18
My expected result is:
Name ID ContractDate LoanSum ClosingDate
A ID1 2022-10-10 10 2022-10-15
A ID1 2022-10-16 8 2022-10-25
A ID1 2022-10-27 25
B ID2 2022-12-12 10 2022-10-15
B ID2 2022-12-16 22 2022-11-18
B ID2 2022-12-20 9 2022-11-25
B ID2 2023-11-29 13
What I have already done: (the code below helps to catch loans but it does not include outstanding loans with no ClosingDate that are bigger than previous loans)
cols = df.columns
w = Window.partitionBy('ID').orderBy('ContractDate')
newdf = df.withColumn('PreviousContractDate', f.lag('ContractDate').over(w)) \
.withColumn('PreviousLoanSum', f.lag('LoanSum').over(w)) \
.withColumn('Target', f.expr('datediff(ContractDate, PreviousContractDate) >= 1 and datediff(ContractDate, PreviousContractDate) < 16 and LoanSum - PreviousLoanSum < 0')) \
.withColumn('Target', f.col('Target') | f.lead('Target').over(w)) \
.filter('Target == True')
Any help is highly appreciated! Thanks a lot for ideas / solutions!
Upvotes: 1
Views: 50