IttayD
IttayD

Reputation: 29123

hadoop: how do i know what file a task was handling when it failed?

I have a job that has some failed tasks. I want to try and reproduce on the files the tasks were handling but can't find how to know which files these were.

How can I find what files a task was handling when it failed?

Upvotes: 1

Views: 39

Answers (2)

IttayD
IttayD

Reputation: 29123

Turns out grepping the logs shows what files the task is reading.

Upvotes: 0

Thomas Jungblut
Thomas Jungblut

Reputation: 20969

I have no idea if that really works, but you may want to try that out (I was coding with Hadoop 2.2):

job.waitForCompletion(true);
Class<? extends InputFormat<?, ?>> clz = job.getInputFormatClass();
InputFormat<?, ?> inputFormat = ReflectionUtils.newInstance(clz, conf);
List<InputSplit> splits = inputFormat.getSplits(job);
TaskCompletionEvent[] events = job.getTaskCompletionEvents(0);
for (TaskCompletionEvent ev : events) {
  if (ev.isMapTask() && ev.getStatus() == Status.FAILED) {
    int idWithinJob = ev.idWithinJob();
    InputSplit inputSplit = splits.get(idWithinJob);
    if (inputSplit instanceof FileSplit) {
      FileSplit sp = (FileSplit) inputSplit;
      System.out.println(sp.getPath() + " failed!");
    }
  }
} 

The idea is rather simple, you get all task events, take map and failed ones. Then you can obtain an index that is usally assigned to the split internally.

The split itself can be obtained by running it over the job data. Please note that the FileSplit can also be a part of the file (block), so you want to check the internal offset and length fields. The type of the split is dependent on the InputFormat, so there is no guarantee that the returned splits are a FileSplit.

Upvotes: 1

Related Questions