Reputation:
How does spark distribute data to workers?
Do the workers read from the data source, or does the driver reads it and sends it to workers? And when a worker needs data that is in another worker, do they communicate directly?
Thanks!
Upvotes: 5
Views: 5100
Reputation: 10677
If you use distributed input methods like SparkContext.textFile then workers read directly from your data source (or if you explicitly open HDFS files from inside worker task code then of course those will also occur on the workers).
If you manually read data in on your main driver program, and then used SparkContext.parallelize
, then indeed your driver will be sending data to your workers.
Data dependencies from worker to worker are generally referred to as the shuffle; this type of worker-to-worker communication is in a lot of ways the heart of most big data processing systems, precisely because it's tricky to do efficiently and reliably. Conceptually you can treat it more-or-less as "communicating directly", but there may be a lot more going on under the hood depending on how the data dependency is taken on.
Upvotes: 9