Reputation: 42050
This is a follow-up to my previous question.
Row
is an ordered set of key value pairs. DataFrame
is a collection of Rows
.
What a data structure is DataFrame
actually ? Is it a list, set, or other "collection" ? Is it a relation
as in SQL ?
Upvotes: 0
Views: 655
Reputation: 6974
Although Dataframe is an abstraction over RDD, the internal representation of Dataframe is quite different than RDD.
RDD is represented as a JAVA objects and uses JVM for all operations. However Dataframe is represented in tungsten.
Here is an excellent article which elaborate how dataframes are represented in tungsten.
Upvotes: 1
Reputation: 191728
It's an abstraction over a RDD[Row]
, or Dataset[Row]
in Spark2, with a defined schema set through a series Column
classes
Is it a list, set, or other "collection" ?
Not in Java terms of those words. Similar to how RDD is none of those, but rather a "lazy collection"
Is it a relation as in SQL ?
You're welcome to run SparkSQL over a Dataframe, but it's a table. Relations are optional
Upvotes: 1