Parallelism of window-functions in Spark 2

Question

I have a rather simple (but essential) question about Spark window functions (such as e.g. lead, lag, count, sum, row_number etc):

If a specify my window as Window.partitionBy(lit(0)) (i.e. I need to run the window fucntion accross the entire dataframe), is the window aggregate function running in parallel, or are all records moved in one single tasks?

EDIT:

Especially for "rolling" operations (e.g. rolling average using something like avg(...).Window.partitionBy(lit(0)).orderBy(...).rowsBetween(-10,10)), this operation could very well be split up into different tasks even tough all data is in the same partition of the window because only 20 row are needed at once to compute the average

Ramesh Maharjan · Accepted Answer

If you define as Window.partitionBy(lit(0)) or if you don't define partitionBy at all, then all of the partitions of a dataframe will be collected as one in one executor and that executor is going to perform the aggregating function on whole dataframe. So parallelism will not be preserved.

The collection is different than the collect() function as collect() function will collect all the partitions into driver node but partitionBy function will collect data to executor where the partitions are easy to be collected.

Parallelism of window-functions in Spark 2

Answers (1)

Related Questions