backtrack
backtrack

Reputation: 8144

Logging hadoop map process

I'm working on hadoop. I have 100k Zip files and processing files using mapreduce But now I have a task that I need to keep track some logs.

1.Zip files processed 2. Zip files need to be process 3. Status of the process, Like error or success

I'm doing it by using the following method

catch (Exception Ex)
        {
            System.out.println("Killing task ");
            runningJob.killTask((TaskAttemptID)context.getTaskAttemptID(), true);

        }

Like this . But now i need to store it in a a common place

How can i do it

I though of storing it in Hbsae. Ideas are welcome Kindly help me

Upvotes: 5

Views: 164

Answers (2)

Ophir Yoktan
Ophir Yoktan

Reputation: 8449

Counters are indeed the best solution, however, don't overuse them, as they also have significant overhead.

You can consider aggregating the counters inside the task, and flush them only from time to time.

Note that if you use a manual mechanism to track these statistics, then you have to account for tasks that are run more then once (because of various errors, or because of speculative execution)

Upvotes: 1

Vlad
Vlad

Reputation: 9481

Here some ideas for you:

  1. Use custom task counters. http://lintool.github.io/Cloud9/docs/content/counters.html they are very lightweight and great way to keep track of small values.

  2. If you need to record more details. There are two ways of doing this. First you can just output log statements as part of your map job. Then you split your pipeline, using two simple filters (map jobs). First filter will take the output of your zip processing and will plug into the rest of your pipeline, second filter will take the log statements and save them into separate location, for further analysis.

    Using HBase would work too, but will bring extra complexity and utilize lot more resources on your cluster. Unless you already have an HBase as a part of your pipeline.

Upvotes: 1

Related Questions