One file database with HDFS and MapReduce

Question

Lets imagine I want to store a big number of urls with associated metadata

URL => Metadata

in a file

hdfs://db/urls.seq

I would like this file to grow (if new URLs are found) after every run of MapReduce.

Would that work with Hadoop? As I understand MapReduce outputs data to a new directory. Is there any way to take that output and append it to the file?

The only idea which comes to my mind is to create a temporary urls.seq and then replace the old one. It works but it feels wasteful. Also from my understanding Hadoop likes the "write once" approach and this idea seams to be in conflict with that.

phoenix · Accepted Answer

As blackSmith has explained that you can easily append an existing file in hdfs but it would bring down your performance because hdfs is designed with "write once" strategy. My suggestion is to avoid this approach until no option left. One approach you may consider that is you can make a new file for every mapreduce output , if size of every output is large enough then this technique will benefit you most because writing a new file will not affect performance as appending does. And also if you are reading the output of each mapreduce in next mapreduce then reading anew file won't affect your performance that much as appending does. So there is a trade off it depends what you want whether performance or simplicity. ( Anyways Merry Christmas !)

One file database with HDFS and MapReduce

Answers (1)

Related Questions