Reputation: 63
Please help me to understand the difference between HDFS' data block and the RDDs in Spark. HDFS distributes a dataset to multiple nodes in a cluster as blocks with same size and data blocks will be replicated mutiple times and stored. RDDs are created as parallelized collection. Are the elements of the Parallelized collections distributed across nodes or it will be stored in memory for processing? Is there any relation to HDFS' data blocks?
Upvotes: 2
Views: 4713
Reputation: 71
Is there any relation to HDFS' data blocks?
In general not. They address different issues
Distribution is common denominator, but that is it, and failure handling strategy are obviously different (DAG re-computation and replication respectively).
Spark can use Hadoop Input Formats, and read data from HDFS. In that case there will be a relationship between HDFS blocks and Spark splits. However Spark doesn't require HDFS and many components of the newer API don't use Hadoop Input Formats anymore.
Upvotes: 7