Reputation: 3225

How to write a Generic Log Parser

We need to parse several log files and run some statistics on the logs entries found (things such as number of occurrence of certain messages, spikes of occurrences, etc). The problem is with writing a log parser that will handle several log formats and will allow me to add a new log format with very little work.

To make things easier for now I'm only looking at logs that will basically look similar to this:

[11/17/11 14:07:14:030 EST] MyXmlParser     E   Premature end of file

so each log entry will contain a timestamp, originator (of the log message), level and log message. One important detail is that a message may have more than one line (e.g. stacktrace). Another instance of the log entry could be:

17-11-2011 14:07:14 ERROR    MyXmlParser   - Premature end of file

I'm looking for a good way to specify the log format as well as the most adequate technology to implement the parser for it. I though about regular expressions but I think it will be tricky to handle situations such as the multi-line message (e.g. stacktrace).

Actually the task of writing a parser for a specific log format does not sound so easy itself when I consider the possibility of multi-line messages. How do you go about parsing those files?

Ideally I would be able to specify something like this as a log format:

[%TIMESTAMP] %ORIGIN %LEVEL %MESSAGE

%TIMESTAMP %LEVEL %ORIGIN - %MESSAGE

Obviously I would have to assign the right converter to each field to it would handle it correctly (e.g. the timestamp).

Could anyone give me some good ideas on how to implement this in a robust and modular way (I'm using Java) ?

Upvotes: 2

Answers (7)

Mario Duarte

Reputation: 3225

I ended up not writing my own and using logstash.

Upvotes: 1

Scott

Reputation: 1736

Log4j's LogFilePatternReceiver does exactly that...

This log entry: 17-11-2011 14:07:14 ERROR MyXmlParser - Premature end of file

Can be parsed using the following logformat (assuming origin is the same as 'logger'), with a timestamp leveraging Java's SimpleDateFormat of dd-MM-yyyy kk:mm:ss

TIMESTAMP LEVEL LOGGER - MESSAGE

The timezone and the level in the other form are a little tricker...there is the ability to remap strings to levels (E to ERROR) but I don't know that the timezone will quite work.

Try it out, check out the source, and play with support for it in the latest developer snapshot of Chainsaw:

http://people.apache.org/~sdeboy

Upvotes: 1

Matthieu BROUILLARD

Reputation: 2031

If you have the possibility (and you should with a good logger framework) I would recommend you to duplicate logs in a parsable format. For example, with log4j use an XMLLayout or something like this. It will be a lot easier to parse because then you will know the exact format of the logs.

You can do this quite transparently to the running app just by setup. Think about using asynchronuous appender in order to not disturb too much the running application.

Also if the XMLLayout can suit your needs have a look at Apache chainsaw

Upvotes: 1

ozoli

Reputation: 1424

Maybe you could write a Log4j CustomAppender? For example as described here: http://mytechattempts.wordpress.com/2011/05/10/log4j-custom-memory-appender/

Your custom appender could use a database or simple Java objects queried by JMX to get your statistics. All just depends on how much data is needed to be persisted.

Upvotes: 0

Olivier Croisier

Reputation: 6149

You can use a Scanner for example, and some regexes. Here is a snippet of what I did to parse some complex logs :

private static final Pattern LINE_PATTERN = Pattern.compile(
  "(\\S+:)?(\\S+? \\S+?) \\S+? DEBUG \\S+? - DEMANDE_ID=(\\d+?) - listener (\\S+?) : (\\S+?)");

public static EventLog parse(String line) throws ParseException {
    String demandId;
    String listenerClass;
    long startTime;
    long endTime;

    SimpleDateFormat sdf = new SimpleDateFormat(DATE_PATTERN);
    Matcher matcher = LINE_PATTERN.matcher(line);
    if (matcher.matches()) {
        int offset = matcher.groupCount()-4; // 4 interesting groups, the first is optional
        demandeId = matcher.group(2+offset);
        listenerClass = matcher.group(3+offset);
        long time = sdf.parse(matcher.group(1+offset)).getTime();
        if ("starting".equals(matcher.group(4+offset))) {
            startTime = time;
            endTime = -1;
        } else {
            startTime = -1;
            endTime = time;
        }
        return new EventLog(demandeId, listenerClass, startTime, endTime);
    }
    return null;
}

So, with regexes and groups, it works pretty well.

Upvotes: 2

James Bassett

Reputation: 9868

At work we rolled our own log parser (in Java) so we could filter the known stacktraces out of the production logs to identify new potential production problems. It uses regex and it's tightly coupled to our log4j log format.

We've also got a python script that runs over the live production transaction logs and reports (to SiteScope - our infrastructure monitoring tool) when the count for particular errors is too high.

While both are useful, they are awful to maintain, and I would recommend trying any open source tool parsing tool first, and resorting to writing your own only if necessary. Heck, I would even pay for a tool that did this ;)

Upvotes: 0

Matt H

Reputation: 6532

AWStats is a great log parser, open source, and you can do whatever you want with the resulting database that it generates.

Upvotes: 3

How to write a Generic Log Parser

Answers (7)

Related Questions