Reputation: 145
I'm looking how to translate this chunk of SQL code into PySpark syntax.
SELECT MEAN(some_value) OVER (
ORDER BY yyyy_mm_dd
RANGE BETWEEN INTERVAL 3 MONTHS PRECEDING AND CURRENT ROW
) AS mean
FROM
df
If the above was a range expressed in days, this could easily have been done using something like
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
(See also ZygD's solution here: Spark Window Functions - rangeBetween dates)
For a range in months, this however doesn't work as the number of days in a month is not a constant. Any idea how to perform a range considering months using PySpark syntax?
Upvotes: 1
Views: 1213
Reputation: 21
Its not ideal, but should be sufficient to estimate roughly mid month or so to get relevant months into the range.
# define monthly windows of interest
window_dict = {
"w1": {"start": -40, "end": -1},
"w2": {"start": -70, "end": -1},
"w3": {"start": -100, "end": -1},
"w4": {"start": -130, "end": -1}
}
# set window - this one is for w3
window_3M = (Window.partitionBy(item_id)
.orderBy(F.expr("unix_date(to_date(ds, 'yyyy-MM-dd'))"))
.rangeBetween(window_dict["w3"]["start"],window_dict["w3"]["end"])
)
# apply as needed
new_df = old_df.withColumn("SumOf3M", F.sum("some_column").over(window_3M))
Upvotes: 0
Reputation: 24478
You can "borrow" the full SQL column expression and use it in PySpark.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('2022-05-01', 1),
('2022-06-01', 2),
('2022-07-01', 3),
('2022-08-01', 4),
('2022-09-01', 5)],
['yyyy_mm_dd', 'some_value']
).withColumn('yyyy_mm_dd', F.to_date('yyyy_mm_dd'))
Script:
df = df.withColumn('mean', F.expr("mean(some_value) over (order by yyyy_mm_dd range between interval 3 months preceding and current row)"))
df.show()
# +----------+----------+----+
# |yyyy_mm_dd|some_value|mean|
# +----------+----------+----+
# |2022-05-01| 1| 1.0|
# |2022-06-01| 2| 1.5|
# |2022-07-01| 3| 2.0|
# |2022-08-01| 4| 2.5|
# |2022-09-01| 5| 3.5|
# +----------+----------+----+
Upvotes: 0