Reputation: 4629
I have a MapReduce job that I need to be in a certain state before running. I tried using a combiner to achieve what I wanted but the job hangs on "Starting flush of map output". Because of this I have decided to move to a chained job flow, which will combine the two jobs into doing what's needed.
I have looked at examples on the web, e.g. Yahoo's docs, but I am unable to find a way to do this using the newer API, and would appreciate an example of how to do so. Below is my flow:
Job1:
Job2:
If someone could give me an example of doing the above, or can point me to some docs, I'd appreciate it.
The chain mapper method isn't appropriate here either, as I need an iterable based on key when I do my reducing.
Upvotes: 0
Views: 909
Reputation: 4629
If you specify a path as the output for the first job, then use the same path as the input for the second job, the data is written to Hadoop temporarily and cleared out after the second job has completed. This is the easiest way to link two jobs using each other's output.
Upvotes: 0
Reputation: 16392
In your java code that configures and submits your jobs (usually the run() method in a ToolRunner subclass), submit your first job with:
job.waitForCompletion(true);
before configuring and submitting the second job. This method will wait until the job completes before returning.
Upvotes: 0