uri
uri

Reputation: 182

Does Hadoop mapreduce re-process entire dataset

I wanted to know: Does hadoop mapreduce re-process the entire dataset if the same job is submitted twice? For example: the word count example counts the occurrence of each word in each file in an input folder. If I were to add a file to that folder, and re-run the word count mapreduce job, will the initial files be re-read, re-maped and re-reduced?

If so, is there a way to configure hadoop to process ONLY the new files and add it to a "summary" from previous mapreduce runs.

Any thought/help will be appreciated.

Upvotes: 2

Views: 477

Answers (3)

Arnon Rotem-Gal-Oz
Arnon Rotem-Gal-Oz

Reputation: 25909

Hadoop itself does not support partial run over data as others mentioned. You can get the functionality you want If you use HBase as the source for your map-reduce and pass a scan with a appropriate filter (e.g. by timestamp larger than last run)

Upvotes: 0

Donald Miner
Donald Miner

Reputation: 39893

I agree with everything Praveen says. I'll provide a specific way that I personally handle this on my cluster.


When I push files into HDFS, I put them into a folder, by the minute, based on the system clock.

$ hadoop fs -put thisfile1249.txt /tmp/
$ hadoop fs -mv /tmp/thisfile1249.txt `date "+/data/%Y/%m/%d/%H/%M/"`

Let's see what the path is going to look like:

$ echo `date "+/data/%Y/%m/%d/%H/%M/"`
/data/2011/12/27/09/49/

This means that as files are coming in, they will go into the folder by the minute. Since time is monotonically increasing, when you run over a folder, you know that you won't have to go back and run over that folder again. If you want to run a job every hour, you can just point your input path to /data/2011/12/27/08. Daily would be /data/2011/12/26, etc.

Upvotes: 0

Praveen Sripati
Praveen Sripati

Reputation: 33495

If I were to add a file to that folder, and re-run the word count mapreduce job, will the initial files be re-read, re-maped and re-reduced?

Hadoop will reprocess the entire data when run again. The output of the mappers and the temporary data is deleted when the job has been completed successfully.

If so, is there a way to configure hadoop to process ONLY the new files and add it to a "summary" from previous mapreduce runs.

Hadoop as-is doesn't support such as scenario, but you could write a custom InputFormat which checks for the unprocessed or new files and a cutom OutputFormat which will add data to the summary from the previous run. Or else once the job has been run, the new files to be processed can be put in a different input folder and let the Job process only the files in the new folder.

Check this article in creating custom input/output formats.

I am not sure of the exact requirements but you can also consider frameworks which process streams of data like HStreaming, S4, Twitter Storm and others.

Upvotes: 3

Related Questions