Krishna Reddy
Krishna Reddy

Reputation: 1099

Spark - Spark RDD is a logical collection of instructions?

A quick walk through the pretty interesting Apache Spark architecture guide for beginners as shown in this tutorial , I came across a couple of queries regarding RDD processing in spark as below,

  1. In my understanding an RDD is a logical collection of instructions that are going to be executed on a physical dataset (lazy execution). Is my understanding correct? or Is it a physical dataset in memory.
  2. Let the file of 20 GB stored in a hdfs and the same is being processed by spark application. This file will be distributed across the hadoop cluster for storage. So, If Datanode A holds 3 blocks of total size 192 MB, this 3 blocks are going to be executed in the same executor of dataNode A or is there any block to executor concept ?

  3. Is executor program responsible to load data from hdfs blocks?

Any help in understanding the above concepts is highly appreciated. Thanks.

Upvotes: 1

Views: 489

Answers (1)

Harel Gliksman
Harel Gliksman

Reputation: 754

1) Kind of both: An rdd contains a graph of its ancestors which are the results of rdd-transformations. It won't be evaluated until an action requires it ( like writing to storage or computing some final value ). However an rdd can be persisted in different storage levels like memory, memory and disk etc. When such an rdd happens to get evaluated it is also persisted ( notice that persisting is lazy as well ). Also, there is a difference between the logical level and the actual execution level. Logical transformations ( narrow ones ) can be executed together making them inseparable on the execution level.

2) There is logic that assigns blocks to executors. Data proximity is a major consideration but sometimes if a machine is busy another machine with free slots might take some blocks.

3) Not sure what you mean exactly, but there is a driver that ( depending on your deployment ) assigns and monitors the execution of tasks by executors. Once a task is assigned to it, the executor collects the data it needs.

Upvotes: 0

Related Questions