Reputation: 34673
I could do this:
hadoop fs -text /path/to/result/of/many/reudcers/part* | hadoop fs -put - /path/to/concatenated/file/target.csv
But it will make the HDFS file get streamed through the network. Is there a way to tell the HDFS to merge few files on the cluster itself?
Upvotes: 0
Views: 1058
Reputation: 13046
I have problem similar to yours. Here is article with number of HDFS files merging options but all of them have some specifics. No one from this list meets my requirements. Hope this could help you.
So main result is if MapReduce setup speed suits you, no problem. If you have realtime requirements, things are getting complex.
One of my 'crazy' ideas is to use HBase coprocessor mechanics (endpoints) and files blocks locality information for this as I have Hbase on the same cluster. If the word 'crazy' doesn't stop you, look at this: http://blogs.apache.org/hbase/entry/coprocessor_introduction
Upvotes: 1