PySpark Windows Function with Conditional Reset

Question

I have a dataframe like this

|  user_id  | acivity_date |
|  -------- | ------------ |
| 49630701  | 1/1/2019     |
| 49630701  | 1/10/2019    |
| 49630701  | 1/28/2019    |
| 49630701  | 2/5/2019     |
| 49630701  | 3/10/2019    |
| 49630701  | 3/21/2019    |
| 49630701  | 5/25/2019    |
| 49630701  | 5/28/2019    |
| 49630701  | 9/10/2019    |
| 49630701  | 1/1/2020     |
| 49630701  | 1/10/2020    |
| 49630701  | 1/28/2020    |
| 49630701  | 2/10/2020    |
| 49630701  | 3/10/2020    |

What I would need to create is the "Group" column, the logic is For every User we need to retain the Group # until the cumulative date difference is less than 30 days, whenever the cumulative date difference is greater than 30 days then we need to increment the group # as well as reset the cumulative date difference to zero

|  user_id  | acivity_date | Group |
|  -------- | ------------ | ----- | 
| 49630701  | 1/1/2019     |  1    |
| 49630701  | 1/10/2019    |  1    |
| 49630701  | 1/28/2019    |  1    | 
| 49630701  | 2/5/2019     |  2    | <- Cumulative date diff till here is 35, which is greater than 30, so increment the Group by 1 and reset the cumulative diff to 0 
| 49630701  | 3/10/2019    |  3    |
| 49630701  | 3/21/2019    |  3    |
| 49630701  | 5/25/2019    |  4    |
| 49630701  | 5/28/2019    |  4    |
| 49630701  | 9/10/2019    |  5    |
| 49630701  | 1/1/2020     |  6    |
| 49630701  | 1/10/2020    |  6    |
| 49630701  | 1/28/2020    |  6    |
| 49630701  | 2/10/2020    |  7    |
| 49630701  | 3/10/2020    |  7    |

I tried with the below code with the loop, but it is not efficient, it is running for hours. Is there a better way to achieve this? Any help would be really appreciated

df= spark.read.table('excel_file)
df1 = df.select(col("user_id"), col("activity_date")).distinct()
partitionWindow = Window.partitionBy("user_id").orderBy(col("activity_date").asc())
lagTest = lag(col("activity_date"), 1, "0000-00-00 00:00:00").over(partitionWindow)
df1 = df1.select(col("*"), (datediff(col("activity_date"),lagTest)).cast("int").alias("diff_val_with_previous"))
df1 = df1.withColumn('diff_val_with_previous', when(col('diff_val_with_previous').isNull(), lit(0)).otherwise(col('diff_val_with_previous')))
distinctUser = [i['user_id'] for i in df1.select(col("user_id")).distinct().collect()]
rankTest = rank().over(partitionWindow)
df2 = df1.select(col("*"), rankTest.alias("rank"))

interimSessionThreshold = 30
totalSessionTimeThreshold = 30
rowList = []

for x in distinctUser:
  tempDf = df2.filter(col("user_id") == x).orderBy(col('activity_date'))
  cumulDiff = 0
  group = 1
  startBatch = True
  len_df = tempDf.count()
  dp = 0
  for i in range(1, len_df+1):
    r = tempDf.filter(col("rank") == i)
    dp = r.select("diff_val_with_previous").first()[0]
    cumulDiff += dp
    if ((dp <= interimSessionThreshold) & (cumulDiff <= totalSessionTimeThreshold)):
      startBatch=False
      rowList.append([r.select("user_id").first()[0], r.select("activity_date").first()[0], group])
    else:
      group += 1
      cumulDiff = 0
      startBatch = True
      dp = 0
      rowList.append([r.select("user_id").first()[0], r.select("activity_date").first()[0], group])

ddf = spark.createDataFrame(rowList, ['user_id', 'activity_date', 'group'])

PySpark Windows Function with Conditional Reset

Answers (1)

Related Questions