lenpyspanacb
lenpyspanacb

Reputation: 333

Filter rows based on multiple advanced criterias in PySpark

Currently I'm performing some calculations on a database that contains information on how loans are paid by borrowers.

And my goal is to create a new dataframe that will include loans that fall under the following criteria:

Here is my dataframe:

Name    ID     ContractDate LoanSum ClosingDate
A       ID1    2022-10-10   10      2022-10-15 
A       ID1    2022-10-16   8       2022-10-25
A       ID1    2022-10-27   25      
B       ID2    2022-12-12   10      2022-10-15
B       ID2    2022-12-16   22      2022-11-18
B       ID2    2022-12-20   9       2022-11-25
B       ID2    2023-11-29   13      
C       ID3    2022-11-11   30      2022-11-18

My expected result is:

Name    ID     ContractDate LoanSum ClosingDate
A       ID1    2022-10-10   10      2022-10-15 
A       ID1    2022-10-16   8       2022-10-25
A       ID1    2022-10-27   25      
B       ID2    2022-12-12   10      2022-10-15
B       ID2    2022-12-16   22      2022-11-18
B       ID2    2022-12-20   9       2022-11-25
B       ID2    2023-11-29   13      

What I have already done: (the code below helps to catch loans but it does not include outstanding loans with no ClosingDate that are bigger than previous loans)

cols = df.columns
w = Window.partitionBy('ID').orderBy('ContractDate')

newdf = df.withColumn('PreviousContractDate', f.lag('ContractDate').over(w)) \
  .withColumn('PreviousLoanSum', f.lag('LoanSum').over(w)) \
  .withColumn('Target', f.expr('datediff(ContractDate, PreviousContractDate) >= 1 and datediff(ContractDate, PreviousContractDate) < 16 and LoanSum - PreviousLoanSum < 0')) \
  .withColumn('Target', f.col('Target') | f.lead('Target').over(w)) \
  .filter('Target == True')

Any help is highly appreciated! Thanks a lot for ideas / solutions!

Upvotes: 1

Views: 50

Answers (0)

Related Questions