Reputation: 1781
When I learn spark SQL, I have a question in my mind:
As said, the SQL execution result is SchemaRDD, but what happens behind the scene? How many transformations or actions in the optimized execution plan, which should be equivalent to plain RDD hand-written codes invoked?
If we write codes by hand instead of SQL, it may generate some intermediate RDDs, e.g. a series of map(), filter() operations upon the source RDD. But the SQL version would not generate intermediate RDDs, correct?
Depending on the SQL content, the generated VM byte codes also involves partitioning, shuffling, correct? But without intermediate RDDs, how could spark schedule and execute them on worker machines?
In fact, I still can not understand the relationship between the spark SQL and spark core. How they interact with each other?
Upvotes: 2
Views: 403
Reputation: 13528
To understand how SparkSQL or the dataframe/dataset DSL maps to RDD operations, look at the physical plan Spark generates using explain
.
sql(/* your SQL here */).explain
myDataframe.explain
At the very core of Spark, RDD[_]
is the underlying datatype that is manipulated using distributed operations. In Spark versions <= 1.6.x DataFrame
is RDD[Row]
and Dataset
is separate. In Spark versions >= 2.x DataFrame
becomes Dataset[Row]
. That doesn't change the fact that underneath it all Spark uses RDD operations.
For a deeper dive into understanding Spark execution, read Understanding Spark Through Visualization.
Upvotes: 3