Reputation: 85
I need to grep full stacktrace from logfile by keyword.
This code works fine, but to slow on big files (more than file the slower). I think the best way to improve regex to find keyword, but I could not get it done.
#!/usr/bin/perl
use strict;
use warnings;
my $regexp;
my $stacktrace;
undef $/;
$regexp = shift;
$regexp = quotemeta($regexp);
while (<>) {
while ( $_ =~ /(?<LEVEL>^[E|W|D|I])\s
(?<TIMESTAMP>\d{6}\s\d{6}\.\d{3})\s
(?<THREAD>.*?)\/
(?<CLASS>.*?)\s-\s
(?<MESSAGE>.*?[\r|\n](?=^[[E|W|D|I]\s\d{6}\s\d{6}\.\d{3}]?))/gsmx ) {
$stacktrace = $&;
if ( $+{MESSAGE} =~ /$regexp/ ) {
print "$stacktrace";
}
}
}
Usage: ./grep_log4j.pl <pattern> <file>
Example: ./grep_log4j.pl Exception sample.log
I think problem in $stacktrace = $&;
because if remove this string and simply print the all matching lines script works fast.
Version of script to print all matches:
#!/usr/bin/perl
use strict;
use warnings;
undef $/;
while (<>) {
while ( $_ =~ /(?<LEVEL>^[E|W|D|I])\s
(?<TIMESTAMP>\d{6}\s\d{6}\.\d{3})\s
(?<THREAD>.*?)\/
(?<CLASS>.*?)\s-\s
(?<MESSAGE>.*?[\r|\n](?=^[[E|W|D|I]\s\d{6}\s\d{6}\.\d{3}]?))/gsmx ) {
print_result();
}
}
sub print_result {
print "LEVEL: $+{LEVEL}\n";
print "TIMESTAMP: $+{TIMESTAMP}\n";
print "THREAD: $+{THREAD}\n";
print "CLASS: $+{CLASS}\n";
print "MESSAGE: $+{MESSAGE}\n";
}
Usage: ./grep_log4j.pl <file>
Example: ./grep_log4j.pl sample.log
Lo4j pattern: %-1p %d %t/%c{1} - %m%n
Example of logfile:
I 111012 141506.000 thread/class - Received message: something
E 111012 141606.000 thread/class - Failed handling mobile request
java.lang.NullPointerException
at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
at java.lang.Thread.run(Thread.java:619)
W 111012 141706.000 thread/class - Received message: something
E 111012 141806.000 thread/class - Failed with Exception
java.lang.NullPointerException
at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
at java.lang.Thread.run(Thread.java:619)
D 111012 141906.000 thread/class - Received message: something
S 111012 142006.000 thread/class - Received message: something
I 111012 142106.000 thread/class - Received message: something
I 111013 142206.000 thread/class - Metrics:0/1
My regex you can find on http://gskinner.com/RegExr/ by log4j keyword:
Upvotes: 4
Views: 1322
Reputation: 52049
You are using:
$/ = undef;
This makes perl read the entire file into memory.
I would process this file line-by-line like this (assuming the stack trace is associated with the message above the trace):
my $matched;
while (<>) {
if (m/^(?<LEVEL>\S+) \s+ (?<TIMESTAMP>(\d+) \s+ ([\d.])+) \s+ (?<THREADCLASS>\S+) \s+ - \s+ (?<REST>.*)/x) {
my %captures = %+;
$matched = ($+{REST} =~ $regexp);
if ($matched) {
print "LEVEL: $captures{LEVEL}\n";
...
}
} elsif ($matched) {
print;
}
}
Here is a general technique for parsing multi-line blocks.
The following loop reads STDIN
one line at a time and feeds complete blocks of the log file to the subroutine process
:
my $first;
my $stack = "";
while (<STDIN>) {
if (m/^\S /) {
process($first, $stack) if $first;
$first = $_;
$stack = "";
} else {
$stack .= $_;
}
}
process($first, $stack) if $first;
sub process {
my ($first, $stack) = @_;
# ... do whatever you want here ...
}
Upvotes: 1
Reputation: 2710
The problem is in misusing []
in your regexp.
[...]
is for defining character classes
(...)
is for grouping
All you need is to change [E|W|D|I]
to [EWDI]
everywhere and not use []
for grouping in MESSAGE
.
Here's final code that works for me:
#!/usr/bin/perl
use strict;
use warnings;
undef $/;
while (<>) {
while (
$_ =~ /(?<LEVEL>^[EWDIS])\s
(?<TIMESTAMP>\d{6}\s\d{6}\.\d{3})\s
(?<THREAD>.*?)\/
(?<CLASS>.*?)\s-\s
(?<MESSAGE>.*?[\r\n](?=[EWDIS]\s\d{6}\s\d{6}\.\d{3}|$))/gmxs
)
{
print_result();
}
}
sub print_result {
print "LEVEL: $+{LEVEL}\n";
print "TIMESTAMP: $+{TIMESTAMP}\n";
print "THREAD: $+{THREAD}\n";
print "CLASS: $+{CLASS}\n";
print "MESSAGE: $+{MESSAGE}\n";
}
Note, that in flag list you missed 'S' letter.
This example also may contains errors, but it works in general.
Upvotes: 0