Multiple HDFS sources for one hadoop MR job

Question

I have two HDFS clusters set up, C1 and C2. Obviously both store a large amount of data but, for this particular job, one of them has a majority of the necessary data (let's say C1 has 90%) and the rest of the data is on C2. I want to write a M/R job that can run on C1 but still have access to the other 10% of the data on C2. Does Hadoop have this kind of feature built-in? Anyone encounter this situation before?

I have a few ideas that I know will work:

1) I can explicitly distcp the necessary data and just run on C1 but I am hoping for a cleaner and more flexible solution.

2) I've seen a bit about HDFSProxy and it seems like it might solve this problem. Any idea how much of a performance hit can I expect?

Either way I expect to have to pay the price of having to pull the desired data from C2 to C1 so that compute nodes in C1 can process the data.

I'm pretty new to Hadoop so any pointers would be greatly appreciated. Thanks!

Multiple HDFS sources for one hadoop MR job

Answers (1)

Related Questions