Niranjan Subramanian
Niranjan Subramanian

Reputation: 591

Need a Regex for parsing Apache files

I need a regex for parsing Apache files

   For example:
 Here is a portion of a /var/log/httpd/error_log

[Sun Sep 02 03:34:01 2012] [notice] Digest: done
[Sun Sep 02 03:34:01 2012] [notice] Apache/2.2.15 (Unix) DAV/2 mod_ssl/2.2.15 OpenSSL/1.0.0- fips SVN/1.6.11 configured -- resuming normal operations
[Sun Sep 02 03:34:01 2012] [error] avahi_entry_group_add_service_strlst("localhost") failed: Invalid host name
[Sun Sep 02 08:01:14 2012] [error] [client 216.244.73.194] File does not exist: /var/www/html/manager
[Sun Sep 02 11:04:35 2012] [error] [client 58.218.199.250] File does not exist: /var/www/html/proxy

I want a regex that includes space as delimiter and excludes embedded space. And the apache error log format alternates between

[DAY MMM DD HH:MM:SS YYYY] [MSG_TYPE] DESCRIPTOR: MESSAGE

[DAY MMM DD HH:MM:SS YYYY] [MSG_TYPE] [SOURCE IP] ERROR: DETAIL

I created 2 Regexes, 1st one is

^(\[[\w:\s]+\]) (\[[\w]+\]) (\[[\w\d.\s]+\])?([\w\s/.(")-]+[\-:]) ([\w/\s]+)$

This one is simple and just match the contents as it is

I want something like the following Regex which I created

      (?<=|\s)([\w:\S]+)

This one doesn't give me the desired output, it doesn't include embedded space. So I need a regex which groups each field, includes embedded space and uses space as delimiter. Pls Help me out with the logic!!!!

my code

void regexparser( CharBuffer cb)
{ try{
    Pattern linePattern = Pattern.compile(".*\r?\n");
    Pattern csvpat = Pattern.compile( "^\\[([\\w:\\s]+)\\] \\[([\\w]+)\\] (\\[([\\w\\d.\\s]+)\\])?([\\w\\s/.(\")-]+[\\-:]) ([\\w/\\s].+)",Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.MULTILINE);
    Matcher lm = linePattern.matcher(cb);
    Matcher pm = null;

    while(lm.find())
    {   //System.out.print("1st loop");
        CharSequence cs = lm.group();

        if (pm==null)
            pm = csvpat.matcher(cs);
            else
                pm.reset(cs);
        while(pm.find())
        {  // System.out.println("2nd loop");
                //System.out.println(pm.groupCount());
                //CharSequence ps = pm.group();
                //System.out.print(ps);
            if(pm.group(4)==null)
                System.out.println(pm.group(1)+" "+pm.group(2)+" "+pm.group(5)+" "+pm.group(6));
            else
                System.out.println(pm.group(1)+" "+pm.group(2)+" "+pm.group(4)+" "+pm.group(5)+" "+pm.group(6));

Upvotes: 0

Views: 701

Answers (1)

memowe
memowe

Reputation: 2668

I agree that this task should be done with an existing solution to parse Apache logs.

However, if you want to try something out for training purposes, maybe you want to start with this. Instead of parsing everything in one single huge regex, I do it in small steps that are much better readable:

Code

#!/usr/bin/env perl

use strict;
use warnings;
use DateTime::Format::Strptime;
use feature 'say';

# iterate log lines
while (defined(my $line = <DATA>)) {
    chomp $line;

    # prepare
    my %data;
    my $strp = DateTime::Format::Strptime->new(
        pattern => '%a %b %d %H:%M:%S %Y',
    );

    # consume date/time
    next unless $line =~ s/^\[(\w+ \w+ \d+ \d\d:\d\d:\d\d \d{4})\] //;
    $data{date} = $strp->parse_datetime($1);

    # consume message type
    next unless $line =~ s/^\[(\w+)\] //;
    $data{type} = $1;

    # "[source ip]" alternative
    if ($line =~ s/^\[(\w+) ([\d\.]+)\] //) {
        @data{qw(source ip)} = ($1, $2);

        # consume "error: detail"
        next unless $line =~ s/([^:]+): (.*)//;
        @data{qw(error detail)} = ($1, $2);
    }

    # "descriptor: message" alternative
    elsif ($line =~ s/^([^:]+): (.*)//) {
        @data{qw(descriptor message)} = ($1, $2);
    }

    # invalid
    else {
        next;
    }

    # something left: invalid
    next if length $line;

    # parsed ok: output
    say "$_: $data{$_}" for keys %data;
    say '-' x 40;
}

__DATA__
[Sun Sep 02 03:34:01 2012] [notice] Digest: done
[Sun Sep 02 03:34:01 2012] [notice] Apache/2.2.15 (Unix) DAV/2 mod_ssl/2.2.15 OpenSSL/1.0.0- fips SVN/1.6.11 configured -- resuming normal operations
[Sun Sep 02 03:34:01 2012] [error] avahi_entry_group_add_service_strlst("localhost") failed: Invalid host name
[Sun Sep 02 08:01:14 2012] [error] [client 216.244.73.194] File does not exist: /var/www/html/manager
[Sun Sep 02 11:04:35 2012] [error] [client 58.218.199.250] File does not exist: /var/www/html/proxy

Output

descriptor: Digest
date: 2012-09-02T03:34:01
type: notice
message: done
----------------------------------------
descriptor: avahi_entry_group_add_service_strlst("localhost") failed
date: 2012-09-02T03:34:01
type: error
message: Invalid host name
----------------------------------------
detail: /var/www/html/manager
source: client
ip: 216.244.73.194
date: 2012-09-02T08:01:14
error: File does not exist
type: error
----------------------------------------
detail: /var/www/html/proxy
source: client
ip: 58.218.199.250
date: 2012-09-02T11:04:35
error: File does not exist
type: error
----------------------------------------

Note that according to your format description, the second line is invalid and ignored by the program.

Upvotes: 1

Related Questions