ddinchev
ddinchev

Reputation: 34673

Merge HDFS files without going through the network

I could do this:

hadoop fs -text /path/to/result/of/many/reudcers/part* | hadoop fs -put - /path/to/concatenated/file/target.csv

But it will make the HDFS file get streamed through the network. Is there a way to tell the HDFS to merge few files on the cluster itself?

Upvotes: 0

Views: 1058

Answers (1)

Roman Nikitchenko
Roman Nikitchenko

Reputation: 13046

I have problem similar to yours. Here is article with number of HDFS files merging options but all of them have some specifics. No one from this list meets my requirements. Hope this could help you.

  • HDFS concat (actually FileSystem.concat()). Not so old API. Requires original file to have last block full.
  • MapReduce jobs: probably I will take some solution based on this technology but it's slow to setup.
  • copyMerge - as far as I can see this will be again copy. But I did not check details yet.
  • File crush - again, looks like MapReduce.

So main result is if MapReduce setup speed suits you, no problem. If you have realtime requirements, things are getting complex.

One of my 'crazy' ideas is to use HBase coprocessor mechanics (endpoints) and files blocks locality information for this as I have Hbase on the same cluster. If the word 'crazy' doesn't stop you, look at this: http://blogs.apache.org/hbase/entry/coprocessor_introduction

Upvotes: 1

Related Questions