Remis Haroon - رامز
Remis Haroon - رامز

Reputation: 3572

Spark : How to combine multiple window-aggregations performed on the same sliding window

I am processing a time-series dataset and I need to calculate stddev, mean etc over a sliding window of (-100 , +100).
I observed that the windowing is applied for each of these calculations even though the sliding window is same for all these.
Is there a way to combine all these calculations, so that there will be only one single window and all the required calculated fields are derived upon that window

  val w = Window.partitionBy("raw_data_field_id").orderBy("date_time_epoch").rowsBetween(-100,100)
  val rawdatax = rawdata
    .withColumn("valueSqrtStdDev", stddev_pop(col("valueSqrt")).over(w))
    .withColumn("valueSqrtMean", mean(col("valueSqrt")).over(w))
    ....

enter image description here

Upvotes: 0

Views: 1042

Answers (1)

RudyVerboven
RudyVerboven

Reputation: 1274

If you really want to use multiple operations over one window you could use UDF/UDAF.

An example of using UDF:

val multipleAgg = udf{ (ls: Seq[Double]) =>
  //perform multiple aggregations
}

val w = Window.partitionBy("raw_data_field_id").orderBy("date_time_epoch").rowsBetween(-100,100)
val rawdatax = rawdata.withColumn("aggregated", multipleAgg(collect_list(col("valueSqrt")).over(w)))

But on the other hand, for performance reasons, I would keep using the built-in DataFrame API if possible. You might be interested in reading this article re: advantages of DataFrame/Dataset API over UDF/UDAF.

Normally in your case the data gets re-partitioned only once, after the first window function. So any concern of data movement and performance is not relevant here.

Upvotes: 2

Related Questions