Reputation: 705
I want to write a job in which each mapper checks if a file from hdfs is stored in the node that is being executed.If this doesn't happen I want to retrieve it from hdfs and store it locally in this node.Is this possible?
EDIT: I am trying to do this (3) Preprocessing for Repartition Join, as described here: link
Upvotes: 1
Views: 257
Reputation: 617
Why would you want to do this? The Data Locality principle used by Hadoop does this for you. Well, it does not move the data, it does move the program.
This comes from the Wikipedia page about Hadoop:
The jobtracker schedules map/reduce jobs to tasktrackers with an awareness of the data location. An example of this would be if node A contained data (x,y,z) and node B contained data (a,b,c). The jobtracker will schedule node B to perform map/reduce tasks on (a,b,c) and node A would be scheduled to perform map/reduce tasks on (x,y,z)
And the reason the computation is moved to the data and not the other way around is explained in the Hadoop documentation itself:
“Moving Computation is Cheaper than Moving Data” A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.
Upvotes: 0
Reputation: 33543
DistributedCache feature in Hadoop can be used to distribute the side data or auxiliary data required for the completion of the job. Here (1, 2) are some interesting articles for the same.
Upvotes: 1