Look back based on X days an get col values based on condition spark

Question

I have the following DF:

--------------------------------
|Id |Date       |Value   |cond  |
|-------------------------------|
|1  |2022-08-03 |     100| 1    |                         
|1  |2022-08-04 |     200| 2    |
|1  |2022-08-05 |     150| 3    |
|1  |2022-08-06 |     300| 4    |
|1  |2022-08-07 |     400| 5    |
|1  |2022-08-08 |     150| 6    |
|1  |2022-08-09 |     500| 7    |
|1  |2022-08-10 |     150| 8    |
|1  |2022-08-11 |     150| 9    |
|1  |2022-08-12 |     700| 1    |
|1  |2022-08-13 |     800| 2    |
|1  |2022-08-14 |     150| 2    |
|1  |2022-08-15 |     300| 0    |
|1  |2022-08-16 |     200| 1    |
|1  |2022-08-17 |     150| 3    |
|1  |2022-08-18 |     150| 1    |
|1  |2022-08-19 |     250| 4    |
|1  |2022-08-20 |     150| 5    |
|1  |2022-08-21 |     400| 6    |    
|2  |2022-08-03 |     100| 1    |
|2  |2022-08-04 |     200| 2    |
|2  |2022-08-05 |     150| 1    |
|2  |2022-08-06 |     300| 1    |
|2  |2022-08-07 |     400| 1    |
|2  |2022-08-08 |     150| 1    |
|2  |2022-08-09 |     125| 1    |
|2  |2022-08-10 |     150| 1    |
|2  |2022-08-11 |     150| 3    |
|2  |2022-08-12 |     170| 6    |
|2  |2022-08-13 |     150| 7    |
|2  |2022-08-14 |     150| 8    |
|2  |2022-08-15 |     300| 1    |
|2  |2022-08-16 |     150| 9    |
|2  |2022-08-17 |     150| 0    |
|2  |2022-08-18 |     400| 1    |
|2  |2022-08-19 |     150| 1    |
|2  |2022-08-20 |     500| 1    |
|2  |2022-08-21 |     150| 1    |
--------------------------------

And this one:

---------------------
|Date       | cond  |
|-------------------|
|2022-08-03 |  1    |
|2022-08-04 |  2    |
|2022-08-05 |  1    |
|2022-08-06 |  1    |
|2022-08-07 |  1    |
|2022-08-08 |  1    |
|2022-08-09 |  1    |
|2022-08-10 |  1    |
|2022-08-11 |  3    |
|2022-08-12 |  6    |
|2022-08-13 |  8    |
|2022-08-14 |  9    |
|2022-08-15 |  1    |
|2022-08-16 |  2    |
|2022-08-17 |  2    |
|2022-08-18 |  0    |
|2022-08-19 |  1    |
|2022-08-20 |  3    |
|2022-08-21 |  1    |
--------------------

My expected output is:

-------------------------------
|Id |Date       |Avg   |Count|
|-----------------------------|
|1  |2022-08-03 |     0| 0    |                         
|1  |2022-08-04 |     0| 0    |
|1  |2022-08-05 |     0| 0    |
|1  |2022-08-06 |     0| 0    |
|1  |2022-08-07 |     0| 0    |
|1  |2022-08-08 |     0| 0    |
|1  |2022-08-09 |     0| 0    |
|1  |2022-08-10 |     0| 0    |
|1  |2022-08-11 |     0| 0    |
|1  |2022-08-12 |     0| 0    |
|1  |2022-08-13 |     0| 0    |
|1  |2022-08-14 |     0| 0    |
|1  |2022-08-15 |     0| 0    |
|1  |2022-08-16 |     0| 0    |
|1  |2022-08-17 |     0| 0    |
|1  |2022-08-18 |     0| 0    |
|1  |2022-08-19 |     0| 0    |
|1  |2022-08-20 |     0| 0    |
|1  |2022-08-21 |     0| 0    |    
|2  |2022-08-03 |     0| 0    |
|2  |2022-08-04 |     0| 0    |
|2  |2022-08-05 |     0| 1    |
|2  |2022-08-06 |     0| 2    |
|2  |2022-08-07 |     0| 3    |
|2  |2022-08-08 | 237,5| 4    |
|2  |2022-08-09 |   250| 4    |
|2  |2022-08-10 |243,75| 4    |
|2  |2022-08-11 |     0| 0    |
|2  |2022-08-12 |     0| 0    |
|2  |2022-08-13 |     0| 0    |
|2  |2022-08-14 |     0| 0    |
|2  |2022-08-15 |206,25| 4    |
|2  |2022-08-16 |     0| 0    |
|2  |2022-08-17 |     0| 0    |
|2  |2022-08-18 |     0| 0    |
|2  |2022-08-19 |243,75| 4    |
|2  |2022-08-20 |     0| 0    |
|2  |2022-08-21 | 337,5| 4    |
-------------------------------

The algorithm is:

Verify if Date and Cond are the same in the first and second DFs.
If the condition is true, I need to lookback on DF1 four days (D-1, D-2, D-3, D-4) based on Cond and calculate the Average(Avg) and count of this values. If I have more then 4 days I need to use the top 4 values to calculate the Avg and Count is going to be always 4 in this case.

Example situations based on the inputs:

Id = 1, Date = 2022-08-08

Count is 0 because the condition is false, then Avg is 0 too.

Id = 2, Date = 2022-08-08

Count is 4 because the condition is true, then I get values of 2022-08-07, 2022-08-06, 2022-08-05, 2022-08-03. I exclude 2022-08-04 because Cond value there is 2, and the Date I'm using as reference Cond is 1.

Id = 2, Date = 2022-08-07

Count is 3 because the condition is true, but I have only the 3 values before that date, so I can't calculate the Avg since I need four values, so in that case Avg is zero.

I tried to use window function, but with no success. I was able to achieve the output DF using SQL (Joins with Outter Apply). But spark doesn't have outter apply. So, my doubts are:

How to generate the output DF.
What is the best the way the generate the output DF.

MVCE to generate the input DFs in pyspark:

data_1=[
    ("1","2022-08-03",100,1),
    ("1","2022-08-04",200,2),
    ("1","2022-08-05",150,3),
    ("1","2022-08-06",300,4),
    ("1","2022-08-07",400,5),
    ("1","2022-08-08",150,6),
    ("1","2022-08-09",500,7),
    ("1","2022-08-10",150,8),
    ("1","2022-08-11",150,9),
    ("1","2022-08-12",700,1),
    ("1","2022-08-13",800,2),
    ("1","2022-08-14",150,2),
    ("1","2022-08-15",300,0),
    ("1","2022-08-16",200,1),
    ("1","2022-08-17",150,3),
    ("1","2022-08-18",150,1),
    ("1","2022-08-19",250,4),
    ("1","2022-08-20",150,5),
    ("1","2022-08-21",400,6),
    ("2","2022-08-03",100,1),
    ("2","2022-08-04",200,2),
    ("2","2022-08-05",150,1),
    ("2","2022-08-06",300,1),
    ("2","2022-08-07",400,1),
    ("2","2022-08-08",150,1),
    ("2","2022-08-09",125,1),
    ("2","2022-08-10",150,1),
    ("2","2022-08-11",150,3),
    ("2","2022-08-12",170,6),
    ("2","2022-08-13",150,7),
    ("2","2022-08-14",150,8),
    ("2","2022-08-15",300,1),
    ("2","2022-08-16",150,9),
    ("2","2022-08-17",150,0),
    ("2","2022-08-18",400,1),
    ("2","2022-08-19",150,1),
    ("2","2022-08-20",500,1),
    ("2","2022-08-21",150,1)
    
]

schema_1 = StructType([
    StructField("Id", StringType(),True),
    StructField("Date", DateType(),True),
    StructField("Value", IntegerType(),True),
    StructField("Cond", IntegerType(),True)
  ])

df_1 = spark.createDataFrame(data=data_1,schema=schema_1)


 data_2 = [
     ("2022-08-03", 1),
     ("2022-08-04", 2),
     ("2022-08-05", 1),
     ("2022-08-06", 1),
     ("2022-08-07", 1),
     ("2022-08-08", 1),
     ("2022-08-09", 1),
     ("2022-08-10", 1),
     ("2022-08-11", 3),
     ("2022-08-12", 6),
     ("2022-08-13", 8),
     ("2022-08-14", 9),
     ("2022-08-15", 1),
     ("2022-08-16", 2),
     ("2022-08-17", 2),
     ("2022-08-18", 0),
     ("2022-08-19", 1),
     ("2022-08-20", 3),
     ("2022-08-21", 1)
 ]

schema_2 = StructType([
    StructField("Date", DateType(),True),
    StructField("Cond", IntegerType(),True)
  ])

df_2 = spark.createDataFrame(data=data_2,schema=schema_2)

UPDATE: I updated the question to be more clearly about the conditions to join the DFs!

viggnah · Accepted Answer

from pyspark.sql import functions as F, Window

# cast to date, and rename columns for later use
df_1 = df_1.withColumn("Date", F.col("Date").cast("date"))
df_2 = df_2.withColumn("Date", F.col("Date").cast("date"))
df_2 = df_2.withColumnRenamed("Date", "DateDf2")\
        .withColumnRenamed("Cond", "CondDf2")

# left join 
df = df_1.join(df_2, (df_1.Cond==df_2.CondDf2)&(df_1.Date==df_2.DateDf2), how='left')

windowSpec = Window.partitionBy("Id", "Cond").orderBy("Date")

# all the magic happens here!
df = (
    # only start counting when "DateDf2" is not null, and put the values into a list
    df.withColumn("value_list", F.when(F.isnull("DateDf2"), F.array()).otherwise(F.collect_list("Value").over(windowSpec.rowsBetween(-4, -1))))
    .withColumn("Count", F.size("value_list"))
    # use aggregate to sum up the list only if the size is 4! and divide by 4 to get average
    .withColumn("Avg", F.when(F.col("count")==4, F.aggregate("value_list", F.lit(0), lambda acc,x: acc+x)/4).otherwise(F.lit(0)))
    .select("Id", "Date", "Avg", "Count")
    .orderBy("Id", "Date")
)

Output is:

+---+----------+------+-----+
|Id |Date      |Avg   |Count|
+---+----------+------+-----+
|1  |2022-08-03|0.0   |0    |
|1  |2022-08-04|0.0   |0    |
|1  |2022-08-05|0.0   |0    |
|1  |2022-08-06|0.0   |0    |
|1  |2022-08-07|0.0   |0    |
|1  |2022-08-08|0.0   |0    |
|1  |2022-08-09|0.0   |0    |
|1  |2022-08-10|0.0   |0    |
|1  |2022-08-11|0.0   |0    |
|1  |2022-08-12|0.0   |0    |
|1  |2022-08-13|0.0   |0    |
|1  |2022-08-14|0.0   |0    |
|1  |2022-08-15|0.0   |0    |
|1  |2022-08-16|0.0   |0    |
|1  |2022-08-17|0.0   |0    |
|1  |2022-08-18|0.0   |0    |
|1  |2022-08-19|0.0   |0    |
|1  |2022-08-20|0.0   |0    |
|1  |2022-08-21|0.0   |0    |
|2  |2022-08-03|0.0   |0    |
|2  |2022-08-04|0.0   |0    |
|2  |2022-08-05|0.0   |1    |
|2  |2022-08-06|0.0   |2    |
|2  |2022-08-07|0.0   |3    |
|2  |2022-08-08|237.5 |4    |
|2  |2022-08-09|250.0 |4    |
|2  |2022-08-10|243.75|4    |
|2  |2022-08-11|0.0   |0    |
|2  |2022-08-12|0.0   |0    |
|2  |2022-08-13|0.0   |0    |
|2  |2022-08-14|0.0   |0    |
|2  |2022-08-15|206.25|4    |
|2  |2022-08-16|0.0   |0    |
|2  |2022-08-17|0.0   |0    |
|2  |2022-08-18|0.0   |0    |
|2  |2022-08-19|243.75|4    |
|2  |2022-08-20|0.0   |0    |
|2  |2022-08-21|337.5 |4    |
+---+----------+------+-----+

Look back based on X days an get col values based on condition spark

Answers (2)

Related Questions