Reputation: 3599
Assume that there is Spark job that is going to read a file named records.txt from HDFS and do some transformations and one action(write the processed output into HDFS). The job will be submitted to YARN cluster mode
Assume also that records.txt is a file of 128 MB and one of its HDFS replicated blocks is also in NODE 1
Lets say YARN is allocating is a executor inside NODE 1 .
How does YARN allocates a executor exactly in a node where the input data is located?
Who tells YARN that one of the replicated HDFS block of records.txt is available in NODE 1 ?
How the data localilty is found By Spark Application ? Is it done by Driver which runs inside Application Master ?
Does YARN know about the datalocality ?
Upvotes: 5
Views: 1487
Reputation: 35219
The fundamental question here is:
Does YARN know about the datalocality ?
YARN "knows" what application tells it and it understand structure (topology) of the cluster. When application makes a resource request, it can include specific locality constraints, which might, or might not be satisfied, when resources are allocated.
If constraints cannot be specified, YARN (or any other cluster manager) will attempt to provide best alternative match, based on its knowledge of the cluster topology.
So how application "knows"?
If application uses input source (file system or other), which supports some form of data locality, it can query it corresponding catalog (namenode in case of HDFS) to get locations of the blocks of data it wants to access.
In broader sense Spark RDD can define preferredLocations
, depending on a specific RDD
implementation, which can be later translated into resource constraints, for the cluster manager (not necessarily YARN).
Upvotes: 7