Under the hood SPARK Dataframe optimization

Question

Leaving aside the database connection aspects that get discussed with mapPartitions for RDDs, and noting that for me the Dataframe under the hood is harder to follow than the RDD abstraction:

Is the performance of a DF now that good, that we would never need to convert to an RDD from a DF, so as to use mapPartitions for the sake of performance in terms of processing?

pushpavanthar · Accepted Answer

From Spark 2.0 onwards the Dataframe is a Dataset organized into named columns. To answer your question, there is no need for Dataframes to be converted back to RDDs to achieve performance and optimization, because, Datasets and Dataframes themselves are very efficient compared to primitive RDDs due to below reasons.

They are built on top of Spark SQL engine, which uses Catalyst Optimizer that leverages advanced programming language features (e.g. Scala’s pattern matching and quasi quotes) to generate an optimized logical and physical query plan. Whereas the Dataset[T] typed API is optimized for data engineering tasks, the untyped Dataset[Row] (an alias of DataFrame) is even faster and suitable for interactive analysis.
Spark as a compiler understands your Dataset type JVM object, it maps your type-specific JVM object to Tungsten’s internal memory representation using Encoders. As a result, Tungsten Encoders can efficiently serialize/deserialize JVM objects as well as generate compact bytecode that can execute at superior speeds.

Under the hood SPARK Dataframe optimization

Answers (1)

Related Questions