RagHaven
RagHaven

Reputation: 4337

In-Memory data for RDD

I have been using Spark and I am curious about how exactly RDDs work. I understand that an RDD is a pointer to the data. If I am trying to create an RDD for a HDFS file, I understand that the RDD will be a pointer to the actual data on the HDFS file.

What I do not understand is where the data gets stored in-memory. When a task is sent to the worker node, does the data for a specific partition get stored in-memory on that worker node? If so, what happens when an RDD partition is stored in memory on worker node1, but worker node2 has to compute a task for the same partition of the RDD? Does worker node2 communicate with worker node1 to get the data for the partition and store it in its own memory?

Upvotes: 0

Views: 296

Answers (1)

Daniel Langdon
Daniel Langdon

Reputation: 5999

In principle, tasks are divided across executors, each one representing its own separate chunk of data (for instance, from HDFS files or folders). The data for a task is loaded in local memory for that executor. Multiple transformations can be chained on the same task.

If, however, a transformation needs to pull data from more than one executor, a new set of tasks will be created, and the results from the previous tasks will be shuffled and re-distributed across executors. For instance, many of the *byKey transformation will shuffle the entire data around, through HDFS, so that executors can perform the second set of tasks. The number of times and valume of shuffled data is critical in Spark's performance.

Upvotes: 1

Related Questions