Spark writing data back to HDFS

Question

I have a question about spark writing the result after computation. I know that each executor writes its result back to HDFS/local-filesystem(based on the cluster manager used) after it completes working on its partitions.

This makes sense because waiting for all executors to complete and writing the result back is not really required if you don't need any aggregation of results.

But how does the write operation work when the data needs to be sorted on a particular column ( eg ID) in ascending or descending order?

Will spark's logical plan sort partitions first based on their ID at each executor before even computations begin? In that case, any executor could complete first and start writing its result to HDFS so how does the whole framework make sure that the final result is sorted?

Thanks in advance

Juh_ · Accepted Answer

From what I understood from this answer: https://stackoverflow.com/a/32888236/1206998 sorting is a process that shuffle all dataset items into "sorted" partition, using RangePartitioner: the "boundaries" between partitions are items that are selected as percentile items of a sample of the dataset.

So something like:

collect a sample set
sort items
select the k*i-th items where i is the sample size divided by the output partition number
broadcast those boundaries
on all input partition, for all items, find which output partition the items should go to by comparing with the broadcast boundaries
send/shuffle data in those output partition
sort items inside each partition

If we have dataset [1,5,6,8, 10, 20, 100] (distributed and in any order) and sort it into 3 partitions, that would gives:

partition 1 = [1,5,6] (sorted within partition)
partition 2 = [8,10] ( " )
partition 3 = [20,100] ( " )

And thus, any later operations can be done on each partition independently, including writing.

Keep in mind that:

spark manage data in-memory and depending on config, it writes partition data locally.
Write is done per partition, but the output files (in distributed FSs like hdfs) are hidden until all data are written. Well at least for parquet writer, not sure for other writers.
As you can expect, sorting is an expensive operation

Spark writing data back to HDFS

Answers (1)

Related Questions