Ged
Ged

Reputation: 18003

Under the hood SPARK Dataframe optimization

Leaving aside the database connection aspects that get discussed with mapPartitions for RDDs, and noting that for me the Dataframe under the hood is harder to follow than the RDD abstraction:

Upvotes: 1

Views: 719

Answers (1)

pushpavanthar
pushpavanthar

Reputation: 869

From Spark 2.0 onwards the Dataframe is a Dataset organized into named columns. To answer your question, there is no need for Dataframes to be converted back to RDDs to achieve performance and optimization, because, Datasets and Dataframes themselves are very efficient compared to primitive RDDs due to below reasons.

  1. They are built on top of Spark SQL engine, which uses Catalyst Optimizer that leverages advanced programming language features (e.g. Scala’s pattern matching and quasi quotes) to generate an optimized logical and physical query plan. Whereas the Dataset[T] typed API is optimized for data engineering tasks, the untyped Dataset[Row] (an alias of DataFrame) is even faster and suitable for interactive analysis.
  2. Spark as a compiler understands your Dataset type JVM object, it maps your type-specific JVM object to Tungsten’s internal memory representation using Encoders. As a result, Tungsten Encoders can efficiently serialize/deserialize JVM objects as well as generate compact bytecode that can execute at superior speeds.

Upvotes: 1

Related Questions