How Spark and S3 interact

Question

I'm wondering how data is loaded into spark in below scenario:

There is 10 GB transaction data stored in S3 in parquet format, I'm going to run a Spark program to categorize every record in that 10 GB Parquet file (e.g. Income, Shopping, Dinning).

I have following questions:

How would this 10 GB distributed into different workers in the Spark Cluster? Does the 10 GB file loaded into Spark Master then Master split the data and send to executors?
If all these happen in memory? What if one of the executors crashed during a job run, will the master load the 10 GB file from S3 again and extract the subset of data that supposed to be processed by the crashed executor and send to another executor?

Nikunj Kakadiya · Accepted Answer

How would this 10 GB distributed into different workers in the Spark Cluster? Does the 10 GB file loaded into Spark Master then Master split the data and send to executors?

Answer: Spark follows Master-Slave architecture. We have one master (Driver/Co-Ordinator) and multiple distributed worker nodes. Driver process runs on the master node and main method of the program runs in driver process. Driver process creates SparkSession or SparkContext. Driver process converts user code to tasks based on the transformation and actions operations in the code from the lineage graph. Driver creates the logical and physical plan and once physical plan is ready it co-ordinates with the cluster manager to get the executors to complete the task. Driver just keeps track of the state of the data(metadata) for each of the executors.

So, 10 GB file does not get loaded to the master node. S3 is a distributed Storage and spark reads from it in a splitted manner. Driver process just decides how the data would get splitted and what each executor needs to work on. Even if you cache the data it gets cached on the executors node only based on the partitions/data that the executor is working on. Also nothing gets triggered unless you call a action operation like count, collect etc. It creates a lineage graph plus DAG to keep track of this information.

If all these happen in memory? What if one of the executors crashed during a job run, will the master load the 10 GB file from S3 again and extract the subset of data that supposed to be processed by the crashed executor and send to another executor?

Answer: As answered in first question, anything gets loaded into memory only when any action is performed. Loaded into memory does not mean it would be loaded into the driver memory. Depending upon the action data gets loaded into memory of driver or executors. If you have used collect operation everything gets loaded into the driver memory but for some other operation like count if you have cached dataframe then the data would get loaded into memory on each of the executor nodes.

Now if one of the executor crashes during the job ran, driver has the lineage graph information and the data (metadata) that the crash executor had, so it runs the same lineage graph on other executor and perform the task. This is what makes Spark resilient and fault tolerance.

How Spark and S3 interact

Answers (2)

Related Questions