Reputation: 57
I'm wondering how data is loaded into spark in below scenario:
There is 10 GB transaction data stored in S3 in parquet format, I'm going to run a Spark program to categorize every record in that 10 GB Parquet file (e.g. Income, Shopping, Dinning).
I have following questions:
Upvotes: 1
Views: 1079
Reputation: 13480
Each worker will issue 1+ GET request on the ranges of the parquet file it has been given; more as it seeks around the files. The whole 10GB file is never loaded anywhere.
each worker will be doing its own read of its own split; this counts against the overall IO capacity of the store/shard.
Upvotes: 1
Reputation: 3008
Answer: Spark follows Master-Slave architecture. We have one master (Driver/Co-Ordinator) and multiple distributed worker nodes. Driver process runs on the master node and main method of the program runs in driver process. Driver process creates SparkSession or SparkContext. Driver process converts user code to tasks based on the transformation and actions operations in the code from the lineage graph. Driver creates the logical and physical plan and once physical plan is ready it co-ordinates with the cluster manager to get the executors to complete the task. Driver just keeps track of the state of the data(metadata) for each of the executors.
So, 10 GB file does not get loaded to the master node. S3 is a distributed Storage and spark reads from it in a splitted manner. Driver process just decides how the data would get splitted and what each executor needs to work on. Even if you cache the data it gets cached on the executors node only based on the partitions/data that the executor is working on. Also nothing gets triggered unless you call a action operation like count, collect etc. It creates a lineage graph plus DAG to keep track of this information.
Answer:
As answered in first question, anything gets loaded into memory only when any action is performed. Loaded into memory does not mean it would be loaded into the driver memory. Depending upon the action data gets loaded into memory of driver or executors. If you have used collect
operation everything gets loaded into the driver memory but for some other operation like count
if you have cached dataframe then the data would get loaded into memory on each of the executor nodes.
Now if one of the executor crashes during the job ran, driver has the lineage graph information and the data (metadata) that the crash executor had, so it runs the same lineage graph on other executor and perform the task. This is what makes Spark resilient and fault tolerance.
Upvotes: 1