Reputation: 601
I have a dataframe of timeseries pricing data, with an ID, Date and Price.
I need to compute the Exponential Moving Average for the Price Column, and add it as a new column to the dataframe.
I have been using Spark's window functions before, and it looked like a fit for this use case, but given the formula for the EMA:
EMA: {Price - EMA(previous day)} x multiplier + EMA(previous day)
where
multiplier = (2 / (Time periods + 1)) //let's assume Time period is 10 days for now
I got a bit confused as to how can I access to the previous computed value in the column, while actually window-ing over the column. With a simple moving average, it's simple, since all you need to do is compute a new column while averaging the elements in the window:
var window = Window.partitionBy("ID").orderBy("Date").rowsBetween(-windowSize, Window.currentRow)
dataFrame.withColumn(avg(col("Price")).over(window).alias("SMA"))
But it seems that with EMA its a bit more complicated since at every step I need the previous computed value.
I have also looked at Weighted moving average in Pyspark but I need an approach for Spark/Scala, and for a 10 or 30 days EMA.
Any ideas?
Upvotes: 4
Views: 5375
Reputation: 601
In the end, I've analysed how exponential moving average is implemented in pandas dataframes. Besides the recursive formula which I described above, and which is difficult to implement in any sql or window function(because its recursive), there is another one, which is detailed on their issue tracker:
y[t] = (x[t] + (1-a)*x[t-1] + (1-a)^2*x[t-2] + ... + (1-a)^n*x[t-n]) /
((1-a)^0 + (1-a)^1 + (1-a)^2 + ... + (1-a)^n).
Given this, and with additional spark implementation help from here, I ended up with the following implementation, which is roughly equivalent with doing pandas_dataframe.ewm(span=window_size).mean().
def exponentialMovingAverage(partitionColumn: String, orderColumn: String, column: String, windowSize: Int): DataFrame = {
val window = Window.partitionBy(partitionColumn)
val exponentialMovingAveragePrefix = "_EMA_"
val emaUDF = udf((rowNumber: Int, columnPartitionValues: Seq[Double]) => {
val alpha = 2.0 / (windowSize + 1)
val adjustedWeights = (0 until rowNumber + 1).foldLeft(new Array[Double](rowNumber + 1)) { (accumulator, index) =>
accumulator(index) = pow(1 - alpha, rowNumber - index); accumulator
}
(adjustedWeights, columnPartitionValues.slice(0, rowNumber + 1)).zipped.map(_ * _).sum / adjustedWeights.sum
})
dataFrame.withColumn("row_nr", row_number().over(window.orderBy(orderColumn)) - lit(1))
.withColumn(s"$column$exponentialMovingAveragePrefix$windowSize", emaUDF(col("row_nr"), collect_list(column).over(window)))
.drop("row_nr")
}
(I am presuming the type of the column for which I need to compute the exponential moving average is Double.)
I hope this helps others.
Upvotes: 8