Reputation: 27373
I have a rather simple (but essential) question about Spark window functions (such as e.g. lead, lag, count, sum, row_number
etc):
If a specify my window as Window.partitionBy(lit(0))
(i.e. I need to run the window fucntion accross the entire dataframe), is the window aggregate function running in parallel, or are all records moved in one single tasks?
EDIT:
Especially for "rolling" operations (e.g. rolling average using something like avg(...).Window.partitionBy(lit(0)).orderBy(...).rowsBetween(-10,10)
), this operation could very well be split up into different tasks even tough all data is in the same partition of the window because only 20 row are needed at once to compute the average
Upvotes: 3
Views: 2080
Reputation: 41957
If you define as Window.partitionBy(lit(0))
or if you don't define partitionBy
at all, then all of the partitions
of a dataframe
will be collected
as one in one executor
and that executor
is going to perform the aggregating
function on whole dataframe
. So parallelism
will not be preserved.
The collection is different than the collect()
function as collect()
function will collect all the partitions
into driver
node but partitionBy
function will collect data to executor
where the partitions
are easy to be collected.
Upvotes: 3