Reputation: 51
I have two HDFS clusters set up, C1 and C2. Obviously both store a large amount of data but, for this particular job, one of them has a majority of the necessary data (let's say C1 has 90%) and the rest of the data is on C2. I want to write a M/R job that can run on C1 but still have access to the other 10% of the data on C2. Does Hadoop have this kind of feature built-in? Anyone encounter this situation before?
I have a few ideas that I know will work:
1) I can explicitly distcp the necessary data and just run on C1 but I am hoping for a cleaner and more flexible solution.
2) I've seen a bit about HDFSProxy and it seems like it might solve this problem. Any idea how much of a performance hit can I expect?
Either way I expect to have to pay the price of having to pull the desired data from C2 to C1 so that compute nodes in C1 can process the data.
I'm pretty new to Hadoop so any pointers would be greatly appreciated. Thanks!
Upvotes: 2
Views: 216
Reputation: 51
I'll go ahead and answer my own question in case anyone else is curious in the future.
Turns out that Hadoop is nice enough to have implemented a solution to this problem. If the inputs are listed as coming from multiple namenodes (i.e., hdfs://namenode1:12345/file1 and hdfs://namenode2:12345/files2) then Hadoop will automatically copy the files from the second cluster to the first and execute. Wherever these commands are run will dictate which cluster the job will execute on.
Obviously this isn't ideal since a small proportion of the job is guaranteed to be bringing the data to the computation, as opposed to taking computation to data, but it will work.
Upvotes: 3