Reputation: 3572
I am processing a time-series dataset and I need to calculate stddev, mean
etc over a sliding window of (-100 , +100).
I observed that the windowing is applied for each of these calculations even though the sliding window is same for all these.
Is there a way to combine all these calculations, so that there will be only one single window and all the required calculated fields are derived upon that window
val w = Window.partitionBy("raw_data_field_id").orderBy("date_time_epoch").rowsBetween(-100,100)
val rawdatax = rawdata
.withColumn("valueSqrtStdDev", stddev_pop(col("valueSqrt")).over(w))
.withColumn("valueSqrtMean", mean(col("valueSqrt")).over(w))
....
Upvotes: 0
Views: 1042
Reputation: 1274
If you really want to use multiple operations over one window you could use UDF/UDAF.
An example of using UDF:
val multipleAgg = udf{ (ls: Seq[Double]) =>
//perform multiple aggregations
}
val w = Window.partitionBy("raw_data_field_id").orderBy("date_time_epoch").rowsBetween(-100,100)
val rawdatax = rawdata.withColumn("aggregated", multipleAgg(collect_list(col("valueSqrt")).over(w)))
But on the other hand, for performance reasons, I would keep using the built-in DataFrame API if possible. You might be interested in reading this article re: advantages of DataFrame/Dataset API over UDF/UDAF.
Normally in your case the data gets re-partitioned only once, after the first window function. So any concern of data movement and performance is not relevant here.
Upvotes: 2