Reputation: 4629

Chaining jobs together in Hadoop MapReduce using new API

I have a MapReduce job that I need to be in a certain state before running. I tried using a combiner to achieve what I wanted but the job hangs on "Starting flush of map output". Because of this I have decided to move to a chained job flow, which will combine the two jobs into doing what's needed.

I have looked at examples on the web, e.g. Yahoo's docs, but I am unable to find a way to do this using the newer API, and would appreciate an example of how to do so. Below is my flow:

Job1:

Map: Read in some text
Reduce: Reduce on some criteria and output the new text

Job2:

Map: Read in the text output from Job1
Reduce: Reduce on some other criteria, and output

If someone could give me an example of doing the above, or can point me to some docs, I'd appreciate it.

The chain mapper method isn't appropriate here either, as I need an iterable based on key when I do my reducing.

Upvotes: 0

Answers (2)

whitfin

Reputation: 4629

If you specify a path as the output for the first job, then use the same path as the input for the second job, the data is written to Hadoop temporarily and cleared out after the second job has completed. This is the easiest way to link two jobs using each other's output.

Upvotes: 0

Chris Gerken

Reputation: 16392

In your java code that configures and submits your jobs (usually the run() method in a ToolRunner subclass), submit your first job with:

job.waitForCompletion(true);

before configuring and submitting the second job. This method will wait until the job completes before returning.

Upvotes: 0

Chaining jobs together in Hadoop MapReduce using new API

Answers (2)

Related Questions