Parsing and normalize the file containing 2-3 millions of line in Perl

Question

I have the log file which contains millions (2-4) of line containing the some special information like IPs, Ports, Email Ids, domains, PIDs etc.

I need to parse and normalize the file in a such way that all of above special tokens will be replaced by some constant string like IP, PORT, EMAIL, DOMAIN etc. and need to provide the count of all duplicates lines.

i.e., for the file having content like below -

Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.3.1 is not reachable
Aug 19 10:22:48 user 10.1.4.1 is not reachable
Aug 19 10:22:48 user 10.1.1.5 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable

The normalize output will be -

MONTH DAY TIME user IP is not reachable   =======>  Count = 14

The log line can have multiple tokens to be search and replaced like domains, email ids. The below code i have written is taking 16 minutes for 10MB of log file( used mail server logs )

Is it possible to minimize that time in Perl when you have to parse that many of line with some regex and substitution operation to perform.

The code snippet i have wrote is -

use strict;
use warnings;

use Tie::Hash::Sorted;
use Getopt::Long;
use Regexp::Common qw(net URI Email::Address );
use Email::Address;

my $ignore    = 0;
my $threshold = 0;
my $normalize = 0;
GetOptions(
    'ignore=s'    => \$ignore,
    'threshold=i' => \$threshold,
    'normalize=i' => \$normalize,
);

my ( %initial_log, %Logs, %final_logs );
my ( $total_lines, $threshold_value );
my $file = shift or die "Usage: $0 FILE
";

open my $fh, '<', $file or die "Could not open '$file' $!";

#Sort the results according to frequency
my $sort_by_numeric_value = sub {
    my $hash = shift;
    [ sort { $hash->{$b} <=> $hash->{$a} } keys %$hash ];
};

#Ignore "ignore" number fields from each line
while ( my $line = <$fh> ) {
    my $skip_words = $ignore;

    chomp $line;
    $total_lines++;

    if ($ignore) {
        my @arr = split( /[\s	]+/smx, $line );
        while ( $skip_words-- != 0 ) { shift @arr; }
        my $n_line = join( ' ', @arr );
        $line = $n_line;
    }

    $initial_log{$line}++;
}

close $fh or die "unable to close: $!";

$threshold_value = int( ( $total_lines / 100 ) * $threshold );

tie my %sorted_init_logs, 'Tie::Hash::Sorted',
    'Hash'         => \%initial_log,
    'Sort_Routine' => $sort_by_numeric_value;

%final_logs = %sorted_init_logs;

if ($normalize) {
    # Normalize the logs
    while ( my ( $line, $count ) = ( each %final_logs ) ) {
        $line = normalize($line);
        $Logs{$line} += $count;
    }
    %final_logs = %Logs;
}

tie my %sorted_logs, 'Tie::Hash::Sorted',
    'Hash'         => \%final_logs,
    'Sort_Routine' => $sort_by_numeric_value;

my $reduced_lines = values(%final_logs);
my $reduction = int( 100 - ( ( values(%final_logs) / $total_lines ) * 100 ) );

print("Number of line in the original logs     = $total_lines");
print("Number of line in the normalized logs   = $reduced_lines");
print("Logs reduced after normalization        = $reduction%
");

# Show the logs below threshold value only
while ( my ( $log, $count ) = ( each %sorted_logs ) ) {

    if ( $count >= $threshold_value ) {
        printf "%-80s ===========> [%s]
", $log, $sorted_logs{$log};
    }
}

sub normalize {
    my $input = shift;

    # Remove unwanted charecters
    $input =~ s/[()]//smxg;

    # Normalize the URI
    $input =~ s/$RE{URI}{HTTP}/URI/smxg;

    # Normalize the IP Addresses
    $input =~ s/$RE{net}{IPv4}/IP/smgx;
    $input =~ s/IP(\W+)\d+/IP$1PORT/smxg;
    $input =~ s/$RE{net}{IPv4}{hex}/HEX_IP/smxg;
    $input =~ s/$RE{net}{IPv4}{bin}/BINARY_IP/smxg;
    $input =~ s/\b$RE{net}{MAC}\b/MAC/smxg;

    # Normalize the Email Addresses
    $input =~ s/(\w+)=$RE{Email}{Address}/$1=EMAIL/smxg;
    $input =~ s/$RE{Email}{Address}/EMAIL/smxg;

    # Normalize the Domain name
    $input =~ s/[A-Za-z0-9-]+(\.[A-Za-z0-9-]+)*(?:\.[A-Za-z]{2,})/HOSTNAME/smxg;
    return $input;
}

Sinan &#220;n&#252;r · Accepted Answer

Especially if you do not know the exact types of queries you'll need to perform, you would be much better off putting parsed log data into an SQLite database. The following example illustrates this using a temporary database. If you want to run multiple different queries against the same data, parse once, load them up in the database, then query to your heart's content. This ought to be faster than what you are doing right now, but, obviously I haven't measured anything:

#!/usr/bin/env perl

use strict;
use warnings;

use DBI;

my $dbh = DBI->connect('dbi:SQLite::memory:', undef, undef,
    {
        RaiseError => 1,
        AutoCommit => 0,
    }
);

$dbh->do(q{
    CREATE TABLE 'status' (
        id      integer primary key,
        month   char(3),
        day     char(2),
        time    char(8),
        agent   varchar(100),
        ip      char(15),
        status  varchar(100)
    )
});

$dbh->commit;

my @cols = qw(month day time agent ip status);

my $inserter = $dbh->prepare(sprintf
    q{INSERT INTO 'status' (%s) VALUES (%s)},
    join(',', @cols),
    join(',', ('?') x @cols)
);

while (my $line = ) {
    $line =~ s/\s+\z//;
    $inserter->execute(split ' ', $line, scalar @cols);
}

$dbh->commit;

my $summarizer = $dbh->prepare(q{
    SELECT
        month,
        day,
        time,
        agent,
        ip,
        status,
        count(*) as count
    FROM status
    GROUP BY month, day, time, agent, ip, status
    }
);

$summarizer->execute;
my $result = $summarizer->fetchall_arrayref;
print "@$_
" for @$result;

$dbh->disconnect;

__DATA__
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.3.1 is not reachable
Aug 19 10:22:48 user 10.1.4.1 is not reachable
Aug 19 10:22:48 user 10.1.1.5 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable

Output:

Aug 19 10:22:48 user 10.1.1.1 is not reachable 4
Aug 19 10:22:48 user 10.1.1.4 is not reachable 3
Aug 19 10:22:48 user 10.1.1.5 is not reachable 1
Aug 19 10:22:48 user 10.1.1.6 is not reachable 5
Aug 19 10:22:48 user 10.1.3.1 is not reachable 1
Aug 19 10:22:48 user 10.1.4.1 is not reachable 1

Parsing and normalize the file containing 2-3 millions of line in Perl

Answers (1)

Related Questions