multi-line search with back reference in large file

Question

Suppose I've got a large text file from which I want to get entries based on the following sample:

DATE TIME some uninteresting text,IDENTIFIER,COMMAND,ADDRESS,some uninteresting text
... some other lines in between ...
DATE TIME some uninteresting text DBCALL some uninteresting text IDENTIFIER some uninteresting text
... some other lines in between ...
DATE TIME some uninteresting text PARAM[1]=PARAM1 some uninteresting text IDENTIFIER some uninteresting text
...

sample with 2 entries:

2014-02-25 09:13:57.765 CET [----s-d] [TL]  [DETAILS:22,6,W1OOKol6IF2DImfgVgJikUb,action_login1,10.1.1.1,n/a,n/a,Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0E)]
2014-02-25 09:13:57.819 CET [------d] [DB] execute {call PKG1.Proc1(?,?,?,?,?,?) } [DETAILS:W1OOKol6IF2DImfgVgJikUb]
2014-02-25 09:13:57.819 CET [------d] [DB] param[1]=loginname1 [DETAILS:W1OOKol6IF2DImfgVgJikUb]

2014-02-25 09:17:17.086 CET [----s-d] [TL]  [DETAILS:22,13,l3Na0H2bNOTv4AiaelSOS97,action_login1,10.1.1.1,n/a,n/a,Mozilla/5.0 (Windows NT 5.1; rv:27.0) Gecko/20100101 Firefox/27.0]
2014-02-25 09:17:17.087 CET [------d] [DB] execute {call PKG1.Proc1(?,?,?,?,?,?) } [DETAILS:l3Na0H2bNOTv4AiaelSOS97]
2014-02-25 09:17:17.087 CET [------d] [DB] param[1]=loginname1 [DETAILS:l3Na0H2bNOTv4AiaelSOS97]

Known variables

COMMAND = "action_login1"
DBCALL = "call PKG1.Proc1"
PARAM1 (e.g. value of param[1]) = "loginname1"

Variables determined at runtime

IDENTIFIER (e.g. session id) = "W1OOKol6IF2DImfgVgJikUb" (example)
ADDRESS (e.g. IP address) = "10.1.1.1" (example)
DATE = "2014-02-25" (example)
TIME = "09:13:57.765" (example)

Variable used to locate related lines

IDENTIFIER (e.g. session id) = "W1OOKol6IF2DImfgVgJikUb" (example)

Expected output

for each group of those 3 lines:

DATE TIME ADDRESS PARAM1 IDENTIFIER COMMAND

sample of the expected output (for the 2 sample records shown above:

2014-02-25 09:13:57.765 10.1.1.1 loginname1 W1OOKol6IF2DImfgVgJikUb action_login1
2014-02-25 09:17:17.086 10.1.1.1 loginname1 l3Na0H2bNOTv4AiaelSOS97 action_login1

Ordering and complexity

the shown order of these 3 lines is guaranteed
there are another lines between these 3, these 3 are just the ones important here
these 3 lines aren't usually very far from each other, usually they all fit into the 2-4kB block (e.g. no need to search until the end of the file when no other related line is found within few kB)
the input file can be very large, can't be fully read into the memory
it isn't guaranteed that there will be no other entries (or even just their parts - only 1 or 2 lines) of the same type located within each entry (block between the 1st and 3rd line), something like in the simple example below can happen.

2014-02-25 09:13:57.765 CET [----s-d] [TL]  [DETAILS:22,6,W1OOKol6IF2DImfgVgJikUb,action_login1,10.1.1.1,n/a,n/a,Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0E)]
2014-02-25 09:17:17.086 CET [----s-d] [TL]  [DETAILS:22,13,l3Na0H2bNOTv4AiaelSOS97,action_login1,10.1.1.1,n/a,n/a,Mozilla/5.0 (Windows NT 5.1; rv:27.0) Gecko/20100101 Firefox/27.0]
2014-02-25 09:17:17.087 CET [------d] [DB] execute {call PKG1.Proc1(?,?,?,?,?,?) } [DETAILS:l3Na0H2bNOTv4AiaelSOS97]
2014-02-25 09:17:17.087 CET [------d] [DB] param[1]=loginname1 [DETAILS:l3Na0H2bNOTv4AiaelSOS97]
2014-02-25 09:13:57.819 CET [------d] [DB] execute {call PKG1.Proc1(?,?,?,?,?,?) } [DETAILS:W1OOKol6IF2DImfgVgJikUb]
2014-02-25 09:13:57.819 CET [------d] [DB] param[1]=loginname1 [DETAILS:W1OOKol6IF2DImfgVgJikUb]

Quantity

1st line: lots (all such operations for all users have such entry)
2nd line: lots (same as 1st)
3rd line: few (only small part of all db calls will match)
the complete output can be also very large, we can't rely on the possibility to keep it fully in memory

Goal

Search for the line with the appropriate DBCALL value (2nd line in the example), get its IDENTIFIER
For that IDENTIFIER locate the nearest line (somewhere below, usually it's just the following line but not always) with the appropriate PARAM[1]=PARAM1 (3rd line in the example). If nothing is found then abort this cycle and continue searching the rest of the file at step 1.
For that IDENTIFIER locate the nearest line (somewhere above) with the appropriate COMMAND (1st line in the example). If nothing is found then abort this cycle and continue searching the rest of the file at step 1.
Print DATE, TIME, ADDRESS, IDENTIFIER, COMMAND for that line (1st line from the example)

In the worst case it can be perhaps simplified to involve just 2 lines instead of all 3, for better reliability the order would be slightly different:

Search for the line with the appropriate COMMAND (1st line in the example), get its IDENTIFIER.
For that IDENTIFIER locate the nearest line (somewhere below) with the appropriate PARAM[1]=PARAM1 value (2nd line in the example). If nothing is found then abort this cycle and continue searching the rest of the file at step 1.
Print DATE, TIME, ADDRESS, IDENTIFIER, COMMAND for that line (1st line from the example).

I managed to get it (more or less) to work (the more simple 2nd scenario) in perl by reading the file by blocks and then doing multi-line regex search using back reference, e.g. with something like:

# read 4kB at a time
local $/ = \4096;
...
my $searchpattern = qr/(\d*-\d*-\d*\s\d*:\d*:\d*\.\d*).*?,(\w*?),$command,.*?param$$1$$=$param1.*?$$ID:(\2)$$/ms;
...

However the problem is that there are some missed matches - the ones that don't completely fit inside single block that perl reads and processes at a time. And as the file is very large, I can't read it whole into the memory. Increasing the size of the block isn't solution as there will always be some cases which can span over multiple blocks because of the positioning of block beginning/end.

Does anyone have any idea for how this can be effectively solved (especially speed and memory wise)?

I also tried awk or sed, but couldn't get them to work properly because of the back reference (to reference the same IDENTIFIER) limitations for multi-line processing etc, e.g. something based on this didn't work:

sed -n '/1st line pattern(match-group-1).../,/3rd line pattern\1.../p'

Because the back reference from the 1st pattern can't be used in the 2nd one. Moreover the sed would print even the entries I'm not interested in - once it finds the 1st matching line for the beginning pattern, it will print everything until the ending pattern is found and if the ending pattern isn't found at all (yes that can happen), it will print everything until the end of the file. That's also something I don't want to happen.

EDIT: added better input sample, clarified description

Notes:

the AWK solution shown by glenn jackman is very nice and works quite well (big thanks for that), however it doesn't cover the possible complexity issues (multiple mixed entries within one block etc). It works only if there are clean blocks of these 3 lines every time. Thus unfortunately it can miss some entries. Example of this solution below, it's supposed to be executed with 2 arguments: 1. PARAM1 (loginname1), 2. input-file

#!/bin/sh
if [ "$#" -ne 2 ]; then
  echo "Usage: $0 loginname logfile" >&2
  exit 1
fi

awk -v dbparam="$1" -v cmd="action_login1" -v dbcall="call PKG1.Proc1" '
    $0 ~ ","cmd"," {
        match($0, /^([0-9]+-[0-9]+-[0-9]+)\ ([0-9]+:[0-9]+:[0-9]+.[0-9]+)/, matches);
        date = matches[1];
        time = matches[2];
        match($0, "DETAILS:[0-9]+,[0-9]+,(.*?),"cmd",([0-9]+.[0-9]+.[0-9]+.[0-9]+),", matches);
        sessionid = matches[1];
        ipaddress = matches[2];
        seen_command = 1;
        seen_dbcall = 0;
    }
    seen_command && $0 ~ dbcall && $0 ~ "$$DETAILS:"sessionid {
        seen_dbcall = 1;
    }
    seen_dbcall && $0 ~ "param\[1$$="dbparam && $0 ~ "\[DETAILS:"sessionid {
        print date, time, ipaddress, sessionid, cmd;
        seen_command = 0;
        seen_dbcall = 0;
    }
' $2

the sgauria's proposal might be the way, however how to do it effectively when I can't rely on storing all the intermediate data into the memory?

ThisSuitIsBlackNot · Accepted Answer

With Perl, you can do this in one pass through the file using a hash:

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;

use Regexp::Common qw(net time);

my $dbcall  = 'call PKG1.Proc1';
my $command = 'action_login1';
my $param1  = 'loginname1';

# Time::Format-compatible pattern
my $date_format = 'yyyy-mm-dd hh:mm{in}:ss.mmm';

my $command_regex = qr/^($RE{time}{tf}{-pat => $date_format}).*$$DETAILS:\d+,\d+,(\w+),$command,($RE{net}{IPv4}),/;
my $dbcall_regex  = qr/execute {$dbcall\(.*\) } \[DETAILS:(\w+)$$/;
my $param1_regex  = qr/param$$1$$=$param1 $$DETAILS:(\w+)$$/;

my %hash;
while () {
    if (/$command_regex/) {
        $hash{$2} = {
            date => $1,
            ip   => $3
        };
    }
    elsif (/$dbcall_regex/) {
        $hash{$1}{seen} = 1;
    }
    elsif (/$param1_regex/) {
        if (exists $hash{$1}{seen}) {
            say join ' ', $hash{$1}{date}, $hash{$1}{ip}, $param1, $1, $command;
            delete $hash{$1};
        }
    }
}

__DATA__
2014-02-25 09:13:57.765 CET [----s-d] [TL]  [DETAILS:22,6,W1OOKol6IF2DImfgVgJikUb,action_login1,10.1.1.1,n/a,n/a,Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0E)]
2014-02-25 09:13:57.819 CET [------d] [DB] execute {call PKG1.Proc1(?,?,?,?,?,?) } [DETAILS:W1OOKol6IF2DImfgVgJikUb]
2014-02-25 09:13:57.819 CET [------d] [DB] param[1]=loginname1 [DETAILS:W1OOKol6IF2DImfgVgJikUb]
2014-02-25 09:17:17.086 CET [----s-d] [TL]  [DETAILS:22,13,l3Na0H2bNOTv4AiaelSOS97,action_login1,10.1.1.1,n/a,n/a,Mozilla/5.0 (Windows NT 5.1; rv:27.0) Gecko/20100101 Firefox/27.0]
2014-02-25 09:17:17.087 CET [------d] [DB] execute {call PKG1.Proc1(?,?,?,?,?,?) } [DETAILS:l3Na0H2bNOTv4AiaelSOS97]
2014-02-25 09:17:17.087 CET [------d] [DB] param[1]=loginname1 [DETAILS:l3Na0H2bNOTv4AiaelSOS97]

Output:

2014-02-25 09:13:57.765 10.1.1.1 loginname1 W1OOKol6IF2DImfgVgJikUb action_login1
2014-02-25 09:17:17.086 10.1.1.1 loginname1 l3Na0H2bNOTv4AiaelSOS97 action_login1

Because the relative order of the lines is set, we can delete the corresponding entry from the hash once we print it, keeping memory usage low.

multi-line search with back reference in large file

Answers (2)

Output:

Related Questions