NikRED
NikRED

Reputation: 1195

Spark RDD vs Dataframe - Data storage

I am new to Spark & learning about the Dataframe,operations & architecture. While reading about the comparison between RDD and Dataframe, i got confused with the data structure of both RDD and Dataframe. Below are my observation, Please help to clarify/correct it if it is wrong

1)RDD is stored in the computer RAM in a distributed manner(blocks) across the nodes in a cluster,if the source data is an a cluster(eg: HDFS).

If the data source is just a single CSV file, the data will be distributed to multiple blocks in the RAM of running server(if Laptop). Am i right?

2)Is there any relationship between block and partition? Which one is super set?

3)Dataframe: Does the Dataframe also getting stored in the same way as RDD? Whether RDD will be created in the backed if i am storing my source data into dataframe alone?

Thanks in advance :)

Upvotes: 2

Views: 457

Answers (1)

Gelerion
Gelerion

Reputation: 1704

RDD is stored in the computer RAM in a distributed manner(blocks) across the nodes in a cluster, if the source data is an a cluster(eg: HDFS).

If caching or checkpointing is enabled it is also might be stored either in memory or on disk. Also, shuffling always involves disk write.

If the data source is just a single CSV file, the data will be distributed to multiple blocks in the RAM of running server(if Laptop). Am i right?

CSV file will be split into multiple partitions, and each task will only read a chunk of data (start-end offsets).

Is there any relationship between block and partition? Which one is super set?

It is a bit confusing, take a look at this answer which states that split is a logical division of the input data while a block is a physical division of data. Spark uses its own terminology and partition in Spark has roughly the same meaning as split in Hadoop.

When a file is read from HDFS HadoopRDD is being used and under the hood, each split will become a partition.

Dataframe: Does the Dataframe also getting stored in the same way as RDD? Whether RDD will be created in the backed if i am storing my source data into dataframe alone?

Dataframe is nothing else than RDD[InternalRow] under the hood.
Take a look at the SparkPlan.

Upvotes: 3

Related Questions