Reputation: 1195
I am new to Spark & learning about the Dataframe,operations & architecture. While reading about the comparison between RDD and Dataframe, i got confused with the data structure of both RDD and Dataframe. Below are my observation, Please help to clarify/correct it if it is wrong
1)RDD is stored in the computer RAM in a distributed manner(blocks) across the nodes in a cluster,if the source data is an a cluster(eg: HDFS).
If the data source is just a single CSV file, the data will be distributed to multiple blocks in the RAM of running server(if Laptop). Am i right?
2)Is there any relationship between block and partition? Which one is super set?
3)Dataframe: Does the Dataframe also getting stored in the same way as RDD? Whether RDD will be created in the backed if i am storing my source data into dataframe alone?
Thanks in advance :)
Upvotes: 2
Views: 457
Reputation: 1704
RDD is stored in the computer RAM in a distributed manner(blocks) across the nodes in a cluster, if the source data is an a cluster(eg: HDFS).
If caching
or checkpointing
is enabled it is also might be stored either in memory or on disk. Also, shuffling always involves disk write.
If the data source is just a single CSV file, the data will be distributed to multiple blocks in the RAM of running server(if Laptop). Am i right?
CSV file will be split into multiple partitions, and each task will only read a chunk of data (start-end offsets).
Is there any relationship between block and partition? Which one is super set?
It is a bit confusing, take a look at this answer which states that split
is a logical division of the input data while a block
is a physical division of data.
Spark uses its own terminology and partition
in Spark has roughly the same meaning as split in Hadoop.
When a file is read from HDFS HadoopRDD is being used and under the hood, each split
will become a partition
.
Dataframe: Does the Dataframe also getting stored in the same way as RDD? Whether RDD will be created in the backed if i am storing my source data into dataframe alone?
Dataframe is nothing else than RDD[InternalRow] under the hood.
Take a look at the SparkPlan.
Upvotes: 3