Aleksei Diaz
Aleksei Diaz

Reputation: 13

Median calculation over windows - rangeBetween over months in pyspark databricks

I am working in databricks pyspark and I am trying to get the median over the last two months excluding the current month. Also, there are multiple accounts so this must be parttioned but account too.

from pyspark.sql.functions import months
windowSpec = Window.partitionBy("account").orderBy('date').rangeBetween(-months(2), months(0))
df = df.withColumn('median',round(median('cost').over(windowSpec),3) )

but I get error message

/databricks/spark/python/pyspark/sql/functions/builtin.py:19025: FutureWarning: Deprecated in 4.0.0, use partitioning.months instead. warnings.warn("Deprecated in 4.0.0, use partitioning.months instead.", FutureWarning)

Second code

from pyspark.sql.functions.partitioning import months
windowSpec = Window.partitionBy("account").orderBy('date').rangeBetween(-months(2), months(0))
df = df.withColumn('median',round(median('cost').over(windowSpec),3) )

but I get error message

Argument col should be a Column or str, got int.

input:

account date cost
account1 2024-10-01 5.00
account1 2024-10-02 6.00
account1 2024-10-03 7.00
account1 2024-11-01 8.00
account1 2024-11-02 9.00
account1 2024-11-03 10.00
account1 2024-12-01 4.88
account1 2024-12-02 8.46
account1 2024-12-03 9.43

expected output:

account date cost median
account1 2024-10-01 5.00 null
account1 2024-10-02 6.00 null
account1 2024-10-03 7.00 null
account1 2024-11-01 8.00 null
account1 2024-11-02 9.00 null
account1 2024-11-03 10.00 null
account1 2024-12-01 4.88 7.5
account1 2024-12-02 8.46 7.5
account1 2024-12-03 9.43 7.5

Upvotes: 1

Views: 82

Answers (1)

Ahmed
Ahmed

Reputation: 385

To obtain the median of the month-2 like this :

+--------+----------+----+------+
| account|      date|cost|median|
+--------+----------+----+------+
|account1|2024-10-01| 5.0|  null|
|account1|2024-10-02| 6.0|  null|
|account1|2024-10-03| 7.0|  null|
|account1|2024-11-01| 8.0|  null|
|account1|2024-11-02| 9.0|  null|
|account1|2024-11-03|10.0|  null|
|account1|2024-12-01|4.88|   6.0|
|account1|2024-12-02|8.46|   6.0|
|account1|2024-12-03|9.43|   6.0|
+--------+----------+----+------+

you can use this code :

from pyspark.sql.functions import col, round, expr, percentile_approx, sum
from pyspark.sql.window import Window


# Calculate a month_offset
df = df.withColumn("year", expr("YEAR(date)")).withColumn("month", expr("MONTH(date)"))
df = df.withColumn("month_offset", expr("year * 12 + month"))

# Define the windows spec with a range between -2 and -2 to get the month - 2
window_spec = Window.partitionBy("account").orderBy("month_offset").rangeBetween(-2, -2)


# Apply the median over the windows spec
df = df.withColumn(
    "median", 
    round(percentile_approx("cost", 0.5).over(window_spec),3) 
)

df.select("account", "date", "cost", "median").show()

I hope this will help you :)

Upvotes: 0

Related Questions