Reputation: 18003
Another item that I read little about.
Leaving S3 aside, and not in the position just now to try out on a bare metal classic data locality approach to Spark, Hadoop, and not in Dynamic Resource Allocation mode, then:
What if a large dataset in HDFS is distributed over (all) N data nodes in the Cluster, but the total-executor-cores parameter is set lower than N, and we need to read all the data on obviously (all) N relevant Data Nodes?
I assume Spark has to ignore this parameter for reading from HDFS. Or not?
If it is ignored, an Executor Core needs to be allocated on that Data Node and is thus acquired by the overall Job and thus this parameter needs to be interpreted to mean for processing and not for reading blocks?
Is the data from such a Data Node immediately shuffled to where the Executors were allocated?
Thanks in advance.
Upvotes: 0
Views: 328
Reputation: 26
There seems to be little bit of confusion here.
Optimal Data locality (node local) is something we want to achieve, not guarantee. All Spark can do is request resources (for example with YARN - How YARN knows data locality in Apache spark in cluster mode) and hope that it will get resources, which satisfy data locality constraints.
If it doesn't it will simply fetch data from remote nodes. However it is not shuffle. It just a simple transfer over network.
So to answer your question - Spark will use resource which has been allocated, trying to do its best do satisfy the constraints. It cannot use nodes, which hasn't been acquired, so it won't automatically get additional nodes for reads.
Upvotes: 1