Reputation: 460
Suppose I've got a large text file from which I want to get entries based on the following sample:
DATE TIME some uninteresting text,IDENTIFIER,COMMAND,ADDRESS,some uninteresting text
... some other lines in between ...
DATE TIME some uninteresting text DBCALL some uninteresting text IDENTIFIER some uninteresting text
... some other lines in between ...
DATE TIME some uninteresting text PARAM[1]=PARAM1 some uninteresting text IDENTIFIER some uninteresting text
...
sample with 2 entries:
2014-02-25 09:13:57.765 CET [----s-d] [TL] [DETAILS:22,6,W1OOKol6IF2DImfgVgJikUb,action_login1,10.1.1.1,n/a,n/a,Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0E)]
2014-02-25 09:13:57.819 CET [------d] [DB] execute {call PKG1.Proc1(?,?,?,?,?,?) } [DETAILS:W1OOKol6IF2DImfgVgJikUb]
2014-02-25 09:13:57.819 CET [------d] [DB] param[1]=loginname1 [DETAILS:W1OOKol6IF2DImfgVgJikUb]
2014-02-25 09:17:17.086 CET [----s-d] [TL] [DETAILS:22,13,l3Na0H2bNOTv4AiaelSOS97,action_login1,10.1.1.1,n/a,n/a,Mozilla/5.0 (Windows NT 5.1; rv:27.0) Gecko/20100101 Firefox/27.0]
2014-02-25 09:17:17.087 CET [------d] [DB] execute {call PKG1.Proc1(?,?,?,?,?,?) } [DETAILS:l3Na0H2bNOTv4AiaelSOS97]
2014-02-25 09:17:17.087 CET [------d] [DB] param[1]=loginname1 [DETAILS:l3Na0H2bNOTv4AiaelSOS97]
Known variables
COMMAND = "action_login1"
DBCALL = "call PKG1.Proc1"
PARAM1 (e.g. value of param[1]) = "loginname1"
Variables determined at runtime
IDENTIFIER (e.g. session id) = "W1OOKol6IF2DImfgVgJikUb" (example)
ADDRESS (e.g. IP address) = "10.1.1.1" (example)
DATE = "2014-02-25" (example)
TIME = "09:13:57.765" (example)
Variable used to locate related lines
Expected output
for each group of those 3 lines:
sample of the expected output (for the 2 sample records shown above:
2014-02-25 09:13:57.765 10.1.1.1 loginname1 W1OOKol6IF2DImfgVgJikUb action_login1
2014-02-25 09:17:17.086 10.1.1.1 loginname1 l3Na0H2bNOTv4AiaelSOS97 action_login1
Ordering and complexity
the shown order of these 3 lines is guaranteed
there are another lines between these 3, these 3 are just the ones important here
these 3 lines aren't usually very far from each other, usually they all fit into the 2-4kB block (e.g. no need to search until the end of the file when no other related line is found within few kB)
the input file can be very large, can't be fully read into the memory
it isn't guaranteed that there will be no other entries (or even just their parts - only 1 or 2 lines) of the same type located within each entry (block between the 1st and 3rd line), something like in the simple example below can happen.
2014-02-25 09:13:57.765 CET [----s-d] [TL] [DETAILS:22,6,W1OOKol6IF2DImfgVgJikUb,action_login1,10.1.1.1,n/a,n/a,Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0E)]
2014-02-25 09:17:17.086 CET [----s-d] [TL] [DETAILS:22,13,l3Na0H2bNOTv4AiaelSOS97,action_login1,10.1.1.1,n/a,n/a,Mozilla/5.0 (Windows NT 5.1; rv:27.0) Gecko/20100101 Firefox/27.0]
2014-02-25 09:17:17.087 CET [------d] [DB] execute {call PKG1.Proc1(?,?,?,?,?,?) } [DETAILS:l3Na0H2bNOTv4AiaelSOS97]
2014-02-25 09:17:17.087 CET [------d] [DB] param[1]=loginname1 [DETAILS:l3Na0H2bNOTv4AiaelSOS97]
2014-02-25 09:13:57.819 CET [------d] [DB] execute {call PKG1.Proc1(?,?,?,?,?,?) } [DETAILS:W1OOKol6IF2DImfgVgJikUb]
2014-02-25 09:13:57.819 CET [------d] [DB] param[1]=loginname1 [DETAILS:W1OOKol6IF2DImfgVgJikUb]
Quantity
1st line: lots (all such operations for all users have such entry)
2nd line: lots (same as 1st)
3rd line: few (only small part of all db calls will match)
the complete output can be also very large, we can't rely on the possibility to keep it fully in memory
Goal
In the worst case it can be perhaps simplified to involve just 2 lines instead of all 3, for better reliability the order would be slightly different:
I managed to get it (more or less) to work (the more simple 2nd scenario) in perl by reading the file by blocks and then doing multi-line regex search using back reference, e.g. with something like:
# read 4kB at a time
local $/ = \4096;
...
my $searchpattern = qr/(\d*-\d*-\d*\s\d*:\d*:\d*\.\d*).*?,(\w*?),$command,.*?param\[1\]=$param1.*?\[ID:(\2)\]/ms;
...
However the problem is that there are some missed matches - the ones that don't completely fit inside single block that perl reads and processes at a time. And as the file is very large, I can't read it whole into the memory. Increasing the size of the block isn't solution as there will always be some cases which can span over multiple blocks because of the positioning of block beginning/end.
Does anyone have any idea for how this can be effectively solved (especially speed and memory wise)?
I also tried awk or sed, but couldn't get them to work properly because of the back reference (to reference the same IDENTIFIER) limitations for multi-line processing etc, e.g. something based on this didn't work:
sed -n '/1st line pattern(match-group-1).../,/3rd line pattern\1.../p'
Because the back reference from the 1st pattern can't be used in the 2nd one. Moreover the sed would print even the entries I'm not interested in - once it finds the 1st matching line for the beginning pattern, it will print everything until the ending pattern is found and if the ending pattern isn't found at all (yes that can happen), it will print everything until the end of the file. That's also something I don't want to happen.
EDIT: added better input sample, clarified description
Notes:
#!/bin/sh
if [ "$#" -ne 2 ]; then
echo "Usage: $0 loginname logfile" >&2
exit 1
fi
awk -v dbparam="$1" -v cmd="action_login1" -v dbcall="call PKG1.Proc1" '
$0 ~ ","cmd"," {
match($0, /^([0-9]+-[0-9]+-[0-9]+)\ ([0-9]+:[0-9]+:[0-9]+.[0-9]+)/, matches);
date = matches[1];
time = matches[2];
match($0, "DETAILS:[0-9]+,[0-9]+,(.*?),"cmd",([0-9]+.[0-9]+.[0-9]+.[0-9]+),", matches);
sessionid = matches[1];
ipaddress = matches[2];
seen_command = 1;
seen_dbcall = 0;
}
seen_command && $0 ~ dbcall && $0 ~ "\\[DETAILS:"sessionid {
seen_dbcall = 1;
}
seen_dbcall && $0 ~ "param\\[1\\]="dbparam && $0 ~ "\\[DETAILS:"sessionid {
print date, time, ipaddress, sessionid, cmd;
seen_command = 0;
seen_dbcall = 0;
}
' $2
Upvotes: 1
Views: 316
Reputation: 24063
With Perl, you can do this in one pass through the file using a hash:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Regexp::Common qw(net time);
my $dbcall = 'call PKG1.Proc1';
my $command = 'action_login1';
my $param1 = 'loginname1';
# Time::Format-compatible pattern
my $date_format = 'yyyy-mm-dd hh:mm{in}:ss.mmm';
my $command_regex = qr/^($RE{time}{tf}{-pat => $date_format}).*\[DETAILS:\d+,\d+,(\w+),$command,($RE{net}{IPv4}),/;
my $dbcall_regex = qr/execute {$dbcall\(.*\) } \[DETAILS:(\w+)\]/;
my $param1_regex = qr/param\[1\]=$param1 \[DETAILS:(\w+)\]/;
my %hash;
while (<DATA>) {
if (/$command_regex/) {
$hash{$2} = {
date => $1,
ip => $3
};
}
elsif (/$dbcall_regex/) {
$hash{$1}{seen} = 1;
}
elsif (/$param1_regex/) {
if (exists $hash{$1}{seen}) {
say join ' ', $hash{$1}{date}, $hash{$1}{ip}, $param1, $1, $command;
delete $hash{$1};
}
}
}
__DATA__
2014-02-25 09:13:57.765 CET [----s-d] [TL] [DETAILS:22,6,W1OOKol6IF2DImfgVgJikUb,action_login1,10.1.1.1,n/a,n/a,Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0E)]
2014-02-25 09:13:57.819 CET [------d] [DB] execute {call PKG1.Proc1(?,?,?,?,?,?) } [DETAILS:W1OOKol6IF2DImfgVgJikUb]
2014-02-25 09:13:57.819 CET [------d] [DB] param[1]=loginname1 [DETAILS:W1OOKol6IF2DImfgVgJikUb]
2014-02-25 09:17:17.086 CET [----s-d] [TL] [DETAILS:22,13,l3Na0H2bNOTv4AiaelSOS97,action_login1,10.1.1.1,n/a,n/a,Mozilla/5.0 (Windows NT 5.1; rv:27.0) Gecko/20100101 Firefox/27.0]
2014-02-25 09:17:17.087 CET [------d] [DB] execute {call PKG1.Proc1(?,?,?,?,?,?) } [DETAILS:l3Na0H2bNOTv4AiaelSOS97]
2014-02-25 09:17:17.087 CET [------d] [DB] param[1]=loginname1 [DETAILS:l3Na0H2bNOTv4AiaelSOS97]
2014-02-25 09:13:57.765 10.1.1.1 loginname1 W1OOKol6IF2DImfgVgJikUb action_login1
2014-02-25 09:17:17.086 10.1.1.1 loginname1 l3Na0H2bNOTv4AiaelSOS97 action_login1
Because the relative order of the lines is set, we can delete the corresponding entry from the hash once we print it, keeping memory usage low.
Upvotes: 1
Reputation: 246764
I like awk to implement this kind of state machine. Something like:
awk -v cmd="$command" -v param="$param1" -v dbcall="$dbcall" '
$0 ~ ","cmd"," {
datetime = parse_datetime_from_line()
identifier = parse_identifier_from_line()
address = parse_address_from_line()
seen_command = 1
seen_dbcall = 0
}
seen_command && $0 ~ dbcall {
seen_dbcall = 1
}
seen_dbcall && $0 ~ param1 {
print datetime, address, identifier, command
seen_command = 0
seen_dbcall = 0
}
' file
You don't describe the input file concretely enough, so extracting the important elements from the line is left as an exercise.
Upvotes: 1