How is data accessed by worker nodes in a Spark Cluster?

Question

I'm trying to understand the functioning of Spark, I know the Cluster manager allocates the resources (Workers) for the driver program. I want to know, how (which transformations) the cluster manager sends the tasks to worker nodes and how worker nodes access the data (Assume my data is in S3)?

Does worker nodes read only a part of data and apply all transformations on it and return the actions to the driver program? or The worker nodes reads the entire file but only apply specific transformation and return back the result to the driver program?

Follow-up questions: How and who decides how much amount of data needs to be sent to worker nodes? as we have established a point that partial data is present on each worker node. Eg: I have two worker nodes with 4 cores each and I have one 1TB csv file to read and perform few transformations and an action. assume the csv is on S3 and on the master node's local storage.

pltc · Accepted Answer

It's going to be a long answer, but I will try to simplify it at my best:

Typically a Spark cluster contains multiple nodes, each node would have multiple CPUs, a bunch of memory, and storage. Each node would hold some chunks of data, therefore sometimes they're also referred to data nodes as well.

When Spark application(s) are started, they tend to create multiple workers or executors. Those workers/executors took resources (CPU, RAM) from the cluster's nodes above. In other words, the nodes in a Spark cluster play both roles: data storage and computation.

But as you might have guessed, data in a node (sometimes) is incomplete, therefore, workers would have to "pull" data across the network to do a partial computation. Then the results are sent back to the driver. The driver would just do the "collection work", and combine them all to get the final results.

Edit #1:

How and who decides how much amount of data needs to be sent to worker nodes

Each task would decide which data is "local" and which is not. This article explains pretty well how data locality works

I have two worker nodes with 4 cores each and I have one 1TB csv file to read and perform a few transformations and an action

This situation is different with the above question, where you have only one file and most likely your worker would be exactly the same as your data node. The executor(s) those are sitting on that worker, however, would read the file piece by piece (by tasks), in parallel, in order to increase parallelism.

How is data accessed by worker nodes in a Spark Cluster?

Answers (1)

Related Questions