Reputation: 63
I have a dataframe with Customer_ID and Invoice_date and I want to convert each customer into either Active, New, Loss or Lapsed category. The data is present from July 2021 to June 2022 (12 months_) The criteria for each split is as:
So far I have tried to create a function using the below code
max_date = F.max(more_cust.INVOICE_DATE)
two_months = F.date_sub(more_cust.INVOICE_DATE, 60)
three_months = F.date_sub(more_cust.INVOICE_DATE, 90)
six_months = F.date_sub(more_cust.INVOICE_DATE, 180)
one_year = F.date_sub(more_cust.INVOICE_DATE, 360)
def recency_bucket(df1):
customer = dict()
df1 = df1.sort("INVOICE_DATE", ascending=False)
var_date = df1.rdd.map(lambda x: x.INVOICE_DATE).collect()
cust_list = df1.rdd.map(lambda x: x.CUST_ID).collect()
customer = customer.withColumn("CUST_ID", df1.collect[0]["cust_list"])
I want the output to look like this:
Upvotes: 0
Views: 62
Reputation: 26
You can categorise your invoice date in quarters say 1(jul to sep 21), 2(oct to dec 21), 3(jan to march 22), 4(april to june 22).
Invoice data
cust_id invoice_date
c1 2021-07-05
c2 2022-02-01
c2 2022-05-10
c3 2022-02-01
c4 2022-04-10
Invoice data with quarter
df = df.withColumn("quarter", F.quarter("invoice_date")).withColumn("quarter", F.when((F.col("quarter")+2) > 4,
(F.col("quarter")+2) % 4).otherwise(F.col("quarter")+2))
+-------+------------+-------+
|cust_id|invoice_date|quarter|
+-------+------------+-------+
| c1| 2021-07-05| 1|
| c2| 2022-02-01| 3|
| c2| 2022-05-10| 4|
| c3| 2022-02-01| 3|
| c4| 2022-04-10| 4|
+-------+------------+-------+
Create pivot table and define rules based on bucket criteria and categorise customers
cust_quarter = df.groupBy("cust_id").pivot("quarter", [1,2,3,4]).count().fillna(0)
cust_quarter.show()
+-------+---+---+---+---+
|cust_id| 1| 2| 3| 4|
+-------+---+---+---+---+
| c1| 1| 0| 0| 0|
| c4| 0| 0| 0| 1|
| c3| 0| 0| 1| 0|
| c2| 0| 0| 1| 1|
+-------+---+---+---+---+
new = ((F.col("4") > 0) & (F.col("1") + F.col("2") + F.col("3") == 0))
active = ((F.col("4") > 0) & (F.col("1") + F.col("2") + F.col("3") > 0))
loss = ((F.col("1") + F.col("2") > 0) & (F.col("3") + F.col("4") == 0))
lapsed = ((F.col("3") > 0) & (F.col("1") + F.col("2") + F.col("4") == 0))
bucket_rules = F.when(new, "new").when(active, "acitve").when(loss, "loss").when(lapsed, "lapsed")
cust_quarter = cust_quarter.withColumn("bucket", bucket_rules)
cust_quarter.show()
+-------+---+---+---+---+------+
|cust_id| 1| 2| 3| 4|bucket|
+-------+---+---+---+---+------+
| c1| 1| 0| 0| 0| loss|
| c4| 0| 0| 0| 1| new|
| c3| 0| 0| 1| 0|lapsed|
| c2| 0| 0| 1| 1|acitve|
+-------+---+---+---+---+------+
Upvotes: 1