Reputation: 2328

What happens under the hood when you sort a DataFrame in Spark?

For example,

df = spark.read.format('csv').option('foo')
df.sort(i)

I know what happens when you read data to the DataFrame, but I am curious what happens when you sort? What's the difference comparing with MapReduce?

Upvotes: 0

Answers (1)

Gangadhar Kadam

Reputation: 546

Spark and MapReduce both are the data processing frameworks. In fact, the key difference between them lies in the approach to processing: Spark can do it in-memory, while Hadoop MapReduce has to read from and write to a disk. As a result, the speed of processing differs significantly – Spark may be up to 100 times faster. However, the volume of data processed also differs: Hadoop MapReduce is able to work with far larger data sets than Spark.

Sampling stage The sampling stage of MapReduce is performed by a lightweight central program and is very quick whereas the disk utilization is quite high while the CPU utilization is low for spark
Map Stage both Spark and MapReduce are CPU-bound in the map stage. Even though Spark and MapReduce use different shuffle frameworks, their map stages are bounded by map output compression. Furthermore, for Spark, we observe that disk I/O is significantly reduced in the map stage compared to the sampling stage, although its map stage also scans the whole input file. The reduced disk I/O is a result of reading input file blocks cached in the OS buffer during the sampling stage.
Reduce Stage The reduce stage in both Spark and MapReduce uses external sort to get a total ordering on the shuffled map output. MapReduce is 2.8x faster than Spark for this stage. The main cause of this speed-up is that the shuffle stage is overlapped with the map stage, which hides the network overhead.

Comparison of Shuffle Components(Spark Vs MapReduce):

Spark uses hash-based shuffle (Spark-Hash) in the previous versions Apache Spark 1.1 which requires maintaining the number of reduce partitions concurrent buffers in memory.
sort-based shuffle (Spark-Sort) in Apache Spark 1.1 onwards, and MapReduce. In sort-based shuffle, at any given point only a single buffer is required. This has led to substantial memory overhead reduction during shuffle and can support workloads with hundreds of thousands of tasks in a single stage (our PB sort used 250,000 tasks).
MapReduce takes a list of outputs coming from “Map Function” and perform these two sub- steps on each and every key-value pair.

Merging step combines all key-value pairs which have same keys (that is grouping key- value pairs by comparing “Key”). This step returns >. Sorting step takes input from Merging step and sort all key-value pairs by using Keys. This step also returns > output but with sorted key-value pairs.

First, the execution time of the map stage increases as we increase the number of reduce tasks, for both Spark-Hash and Spark-Sort. This is because of the increased overhead for handling opened files and the commit operation of disk writes.

As opposed to Spark, the number of reduce tasks has little effect on the execution time of the map stage for MapReduce. The number of reduce tasks has no affect on the execution time of Spark’s reduce stage.

However, for MapReduce, the execution time of the reduce stage increases as more reduce tasks are used because less map output can be copied in parallel with the map stage as the number of reduce tasks increases. For both MapReduce and Spark, when the buffer size increases, the reduced disk spills cannot lead to the reduction in the execution time since disk I/O is not a bottleneck. However, the increased buffer size may lead to slow-down in Spark due to the increased overhead for GC and page swapping in OS buffer cache.

MapReduce is slower than Spark in below 2 stages

First, the load time in MapReduce is much slower than that in Spark.
Second, the total times of (1) reading the input (Read), and (2) for applying the map function on the input (Map), is higher than Spark.

The reasons why Spark performs better include:

Spark reads part of the input from the OS buffer cache since its sampling stage scans the whole input file. On the other hand, MapReduce only partially reads the input file during sampling thus OS buffer cache is not very effective during the map stage.
MapReduce collects the map output in a map side buffer before flushing it to disk, but Spark’s hash-based shuffle writer, writes each map output record directly to disk, which reduces latency.

To summarize

In MapReduce, the reduce stage is faster than Spark because MapReduce can overlap the shuffle stage with the map stage, which effectively hides the network overhead.
In Spark, the execution time of the map stage increases as the number of reduce tasks increase. This overhead is caused by and is proportional to the number of files opened simultaneously.
For both MapReduce and Spark, the reduction of disk spills during the shuffle stage may not lead to the speed-up since disk I/O is not a bottleneck. However, for Spark, the increased buffer may lead to the slow-down because of increased overhead for GC and OS page swapping.

Upvotes: 4

What happens under the hood when you sort a DataFrame in Spark?

Answers (1)

Related Questions