Reputation: 103
I've got multi-GB mailserver log file and a list of ~350k messages ID. I want to pull out from the big log file rows with IDs from the long list... and I want it faster than it is now... Currently I do it in perl:
#!/usr/bin/perl
use warnings;
#opening file with the list - over 350k unique ID
open ID, maillog_id;
@lista_id = <ID>;
close ID;
chomp @lista_id;
open LOG, maillog;
# while - foreach would cause out of memory
while ( <LOG> ) {
$wiersz = $_;
my @wiersz_split = split ( ' ' , $wiersz );
#
foreach ( @lista_id ) {
$id = $_;
# ID in maillog is 6th column
if ( $wiersz_split[5] eq $id) {
# print whole row when matched - can be STDOUT or file or anything
print "@wiersz_split\n";
}
}
}
close LOG;
It works but it is slow... Every line from log is taken into comparison with list of ID. Should I use database and perform a kind of join? Or compare substrings?
There are a lot of tools for log analyse - e.g. pflogsumm... but it just summarizes. E.g. I could use
grep -c "status=sent" maillog
It would be fast but useless and I would use it AFTER filtering my log file... the same is for pflogsumm etc. - just increasing variables.
Any suggestions?
-------------------- UPDATE -------------------
thank you Dallaylaen,
I succeded with this (instead internal foreach on @lista_id
):
if ( exists $lista_id_hash{$wiersz_split[5]} ) { print "$wiersz"; }
where %lista_id_hash
is a hash table where keys are items taken from my ID list. It works superfast.
Processing 4,6 GB log file with >350k IDs takes less than 1 minute to filter interesting logs.
Upvotes: 0
Views: 217
Reputation: 5308
Use a hash.
my %known;
$known{$_} = 1 for @lista_id;
# ...
while (<>) {
# ... determine id
if ($known{$id}) {
# process line
};
};
P.S. If your log is THAT big, you're probably better off with splitting it according to e.g. last two letters of $id into 256 (or 36**2?) smaller files. Something like a poor man's MapReduce. The number of IDs to store in memory at a time will also be reduced (i.e. when you're processing maillog.split.cf, you should only keep IDs ending in "cf" in hash).
Upvotes: 2