Reputation: 1275
I am trying to understand if there is a relationship between RDDs and Dataframes/Datesets from a technical point of view. RDDs are often described as the fundamental data abstraction in Spark. In my understanding this would mean that Dataframes/Datasets should also be based on it. In the original Spark SQL Paper the figures 1 & 3 point to this connection. However, I haven't found any documentation on how this connection looks like (if it exists at all).
So my question: Are Dataframes/Datasets based on RDDs or are these two concepts independent?
Upvotes: 0
Views: 348
Reputation: 2108
Dataframe and Datasets are based on the Rdd, however this is a little bit hidden. The fact is that Dataframe and Datasets are more used on the spark-sql project, where as Rdd are on the spark-core.
Here is the technical point of view on how Dataframe, which is Dataset[Row], and Rdd are linked: Dataframe has a QueryExecution
which controls how all the sql execution acts. Now when this get executed by the engine it will be output in an internal rdd of type Row, lazy val toRdd: RDD[InternalRow] = executedPlan.execute()
. Having that rdd and a schema it will form a Dataframe.
Upvotes: 2