Analyze total error entry occurance in a time frame from log files with a hadoop mapreduce job

Question

I have a huge number of logfiles stored in HDFS which look like the following:

2012-10-20 00:05:00; BEGIN
...
SQL ERROR -678: Error message
...
2012-10-20 00:47:20; END

I'd like to know how often certain sql error codes occured during a time frame, e.g.: How many 678 SQL ERRORs occured from 20 OCT 2012 0:00am until 20 OCT 2012 1:00am.

Since files are typically split into several blocks they could be distributed between all data nodes.

Is such a query possible? I'd like to use the hadoop mapreduce Java API or Apache Pig, but I don't know how to apply the time frame condition.

Praveen Sripati · Accepted Answer

HDFS doesn't take care of new lines into consideration while splitting the file into blocks, so a single line might be split across two blocks. But, MapReduce does, so a line in the input file will be processed by a single mapper.

2012-10-20 00:05:00; BEGIN
...
SQL ERROR -678: Error message
...
2012-10-20 00:47:20; END

If the file is bigger than the block size then there is a better chance that the above lines will be in two blocks and processed by different mappers. The FileInputFormat.isSplitable() can be overwritten to make sure that a single log file is processed by a single mapper and not processed by multiple mappers.

Hadoop will invoke the user defined map function with the KV pairs, where K is the file offset and the value is the line in the input file. An instance variable would be required to store the BEGIN time to check against the END time in the later call to the user defined map function.

This is not an efficient way, since a single mapper is processing a particular map file and is not distributed.

Another approach is to pre-process the log files, by combining the the relevant lines into a single line. This way, the relevant lines in the log files will be processed by a single mapper only.

FYI, a more complex approach without using the FileInputFormat.isSplitable() is a also possbile, but that needs to be worked out.

The pros and cons have to be evaluated of each approach and the right one picked.

Analyze total error entry occurance in a time frame from log files with a hadoop mapreduce job

Answers (1)

Related Questions