Amazon Elastic Map Reduce: Job flow fails because output file is not yet generated

Question

I have an Amazon EMR job flow that performs three tasks, the output from the first being the input to the subsequent two. The second task's output is used by the third task DistributedCache.

I've created the job flow entirely on the EMR web site (console) but the cluster fails immediately because it cannot find the distributed cache file - because it has not yet been created by step #1.

Is my only option to create these steps from the CLI via a boostrap action, and specify the --wait-for-steps option? It seems strange that I cannot execute a multi-step job flow where the input of one task relies on the output of another.

n4cer500 · Accepted Answer

In the end I got around this by creating an Amazon EMR cluster that bootstrapped but had no steps. Then I SSH'd into the head and ran the hadoop jobs on the console.

I now have the flexibility to add them to a script with individual configuration options per job.

Amazon Elastic Map Reduce: Job flow fails because output file is not yet generated

Answers (1)

Related Questions