Reputation: 8144
I'm working on hadoop. I have 100k Zip files and processing files using mapreduce But now I have a task that I need to keep track some logs.
1.Zip files processed 2. Zip files need to be process 3. Status of the process, Like error or success
I'm doing it by using the following method
catch (Exception Ex)
{
System.out.println("Killing task ");
runningJob.killTask((TaskAttemptID)context.getTaskAttemptID(), true);
}
Like this . But now i need to store it in a a common place
How can i do it
I though of storing it in Hbsae. Ideas are welcome Kindly help me
Upvotes: 5
Views: 164
Reputation: 8449
Counters are indeed the best solution, however, don't overuse them, as they also have significant overhead.
You can consider aggregating the counters inside the task, and flush them only from time to time.
Note that if you use a manual mechanism to track these statistics, then you have to account for tasks that are run more then once (because of various errors, or because of speculative execution)
Upvotes: 1
Reputation: 9481
Here some ideas for you:
Use custom task counters. http://lintool.github.io/Cloud9/docs/content/counters.html they are very lightweight and great way to keep track of small values.
If you need to record more details. There are two ways of doing this. First you can just output log statements as part of your map job. Then you split your pipeline, using two simple filters (map jobs). First filter will take the output of your zip processing and will plug into the rest of your pipeline, second filter will take the log statements and save them into separate location, for further analysis.
Using HBase would work too, but will bring extra complexity and utilize lot more resources on your cluster. Unless you already have an HBase as a part of your pipeline.
Upvotes: 1