mailserver log filtering

Question

I've got multi-GB mailserver log file and a list of ~350k messages ID. I want to pull out from the big log file rows with IDs from the long list... and I want it faster than it is now... Currently I do it in perl:

#!/usr/bin/perl

use warnings;

#opening file with the list - over 350k unique ID
open ID, maillog_id;
@lista_id = ;
close ID;
chomp @lista_id;

open LOG, maillog;
# while - foreach would cause out of memory
while (  ) {
        $wiersz = $_;
        my @wiersz_split = split ( ' ' , $wiersz );
        #
        foreach ( @lista_id ) {
          $id   = $_;
          # ID in maillog is 6th column
          if ( $wiersz_split[5] eq $id)  {
            # print whole row when matched - can be STDOUT or file or anything
            print "@wiersz_split
";
          }
        }
}
close LOG;

It works but it is slow... Every line from log is taken into comparison with list of ID. Should I use database and perform a kind of join? Or compare substrings?

There are a lot of tools for log analyse - e.g. pflogsumm... but it just summarizes. E.g. I could use

grep -c "status=sent" maillog

It would be fast but useless and I would use it AFTER filtering my log file... the same is for pflogsumm etc. - just increasing variables.

Any suggestions?

-------------------- UPDATE -------------------

thank you Dallaylaen, I succeded with this (instead internal foreach on @lista_id):

if ( exists $lista_id_hash{$wiersz_split[5]} ) { print "$wiersz"; }

where %lista_id_hash is a hash table where keys are items taken from my ID list. It works superfast. Processing 4,6 GB log file with >350k IDs takes less than 1 minute to filter interesting logs.

Dallaylaen · Accepted Answer

Use a hash.

my %known;
$known{$_} = 1 for @lista_id;
# ...
while (<>) {
    # ... determine id
    if ($known{$id}) {
         # process line
    };
};

P.S. If your log is THAT big, you're probably better off with splitting it according to e.g. last two letters of $id into 256 (or 36**2?) smaller files. Something like a poor man's MapReduce. The number of IDs to store in memory at a time will also be reduced (i.e. when you're processing maillog.split.cf, you should only keep IDs ending in "cf" in hash).

mailserver log filtering

Answers (1)

Related Questions