Petr H
Petr H

Reputation: 460

multi-line search with back reference in large file

Suppose I've got a large text file from which I want to get entries based on the following sample:

DATE TIME some uninteresting text,IDENTIFIER,COMMAND,ADDRESS,some uninteresting text
... some other lines in between ...
DATE TIME some uninteresting text DBCALL some uninteresting text IDENTIFIER some uninteresting text
... some other lines in between ...
DATE TIME some uninteresting text PARAM[1]=PARAM1 some uninteresting text IDENTIFIER some uninteresting text
...

sample with 2 entries:

2014-02-25 09:13:57.765 CET [----s-d] [TL]  [DETAILS:22,6,W1OOKol6IF2DImfgVgJikUb,action_login1,10.1.1.1,n/a,n/a,Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0E)]
2014-02-25 09:13:57.819 CET [------d] [DB] execute {call PKG1.Proc1(?,?,?,?,?,?) } [DETAILS:W1OOKol6IF2DImfgVgJikUb]
2014-02-25 09:13:57.819 CET [------d] [DB] param[1]=loginname1 [DETAILS:W1OOKol6IF2DImfgVgJikUb]

2014-02-25 09:17:17.086 CET [----s-d] [TL]  [DETAILS:22,13,l3Na0H2bNOTv4AiaelSOS97,action_login1,10.1.1.1,n/a,n/a,Mozilla/5.0 (Windows NT 5.1; rv:27.0) Gecko/20100101 Firefox/27.0]
2014-02-25 09:17:17.087 CET [------d] [DB] execute {call PKG1.Proc1(?,?,?,?,?,?) } [DETAILS:l3Na0H2bNOTv4AiaelSOS97]
2014-02-25 09:17:17.087 CET [------d] [DB] param[1]=loginname1 [DETAILS:l3Na0H2bNOTv4AiaelSOS97]

Known variables

Variables determined at runtime

Variable used to locate related lines

Expected output

for each group of those 3 lines:

sample of the expected output (for the 2 sample records shown above:

2014-02-25 09:13:57.765 10.1.1.1 loginname1 W1OOKol6IF2DImfgVgJikUb action_login1
2014-02-25 09:17:17.086 10.1.1.1 loginname1 l3Na0H2bNOTv4AiaelSOS97 action_login1

Ordering and complexity


2014-02-25 09:13:57.765 CET [----s-d] [TL]  [DETAILS:22,6,W1OOKol6IF2DImfgVgJikUb,action_login1,10.1.1.1,n/a,n/a,Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0E)]
2014-02-25 09:17:17.086 CET [----s-d] [TL]  [DETAILS:22,13,l3Na0H2bNOTv4AiaelSOS97,action_login1,10.1.1.1,n/a,n/a,Mozilla/5.0 (Windows NT 5.1; rv:27.0) Gecko/20100101 Firefox/27.0]
2014-02-25 09:17:17.087 CET [------d] [DB] execute {call PKG1.Proc1(?,?,?,?,?,?) } [DETAILS:l3Na0H2bNOTv4AiaelSOS97]
2014-02-25 09:17:17.087 CET [------d] [DB] param[1]=loginname1 [DETAILS:l3Na0H2bNOTv4AiaelSOS97]
2014-02-25 09:13:57.819 CET [------d] [DB] execute {call PKG1.Proc1(?,?,?,?,?,?) } [DETAILS:W1OOKol6IF2DImfgVgJikUb]
2014-02-25 09:13:57.819 CET [------d] [DB] param[1]=loginname1 [DETAILS:W1OOKol6IF2DImfgVgJikUb]

Quantity

Goal

  1. Search for the line with the appropriate DBCALL value (2nd line in the example), get its IDENTIFIER
  2. For that IDENTIFIER locate the nearest line (somewhere below, usually it's just the following line but not always) with the appropriate PARAM[1]=PARAM1 (3rd line in the example). If nothing is found then abort this cycle and continue searching the rest of the file at step 1.
  3. For that IDENTIFIER locate the nearest line (somewhere above) with the appropriate COMMAND (1st line in the example). If nothing is found then abort this cycle and continue searching the rest of the file at step 1.
  4. Print DATE, TIME, ADDRESS, IDENTIFIER, COMMAND for that line (1st line from the example)

In the worst case it can be perhaps simplified to involve just 2 lines instead of all 3, for better reliability the order would be slightly different:

  1. Search for the line with the appropriate COMMAND (1st line in the example), get its IDENTIFIER.
  2. For that IDENTIFIER locate the nearest line (somewhere below) with the appropriate PARAM[1]=PARAM1 value (2nd line in the example). If nothing is found then abort this cycle and continue searching the rest of the file at step 1.
  3. Print DATE, TIME, ADDRESS, IDENTIFIER, COMMAND for that line (1st line from the example).

I managed to get it (more or less) to work (the more simple 2nd scenario) in perl by reading the file by blocks and then doing multi-line regex search using back reference, e.g. with something like:

# read 4kB at a time
local $/ = \4096;
...
my $searchpattern = qr/(\d*-\d*-\d*\s\d*:\d*:\d*\.\d*).*?,(\w*?),$command,.*?param\[1\]=$param1.*?\[ID:(\2)\]/ms;
...

However the problem is that there are some missed matches - the ones that don't completely fit inside single block that perl reads and processes at a time. And as the file is very large, I can't read it whole into the memory. Increasing the size of the block isn't solution as there will always be some cases which can span over multiple blocks because of the positioning of block beginning/end.

Does anyone have any idea for how this can be effectively solved (especially speed and memory wise)?

I also tried awk or sed, but couldn't get them to work properly because of the back reference (to reference the same IDENTIFIER) limitations for multi-line processing etc, e.g. something based on this didn't work:

sed -n '/1st line pattern(match-group-1).../,/3rd line pattern\1.../p'

Because the back reference from the 1st pattern can't be used in the 2nd one. Moreover the sed would print even the entries I'm not interested in - once it finds the 1st matching line for the beginning pattern, it will print everything until the ending pattern is found and if the ending pattern isn't found at all (yes that can happen), it will print everything until the end of the file. That's also something I don't want to happen.

EDIT: added better input sample, clarified description

Notes:


#!/bin/sh
if [ "$#" -ne 2 ]; then
  echo "Usage: $0 loginname logfile" >&2
  exit 1
fi

awk -v dbparam="$1" -v cmd="action_login1" -v dbcall="call PKG1.Proc1" '
    $0 ~ ","cmd"," {
        match($0, /^([0-9]+-[0-9]+-[0-9]+)\ ([0-9]+:[0-9]+:[0-9]+.[0-9]+)/, matches);
        date = matches[1];
        time = matches[2];
        match($0, "DETAILS:[0-9]+,[0-9]+,(.*?),"cmd",([0-9]+.[0-9]+.[0-9]+.[0-9]+),", matches);
        sessionid = matches[1];
        ipaddress = matches[2];
        seen_command = 1;
        seen_dbcall = 0;
    }
    seen_command && $0 ~ dbcall && $0 ~ "\\[DETAILS:"sessionid {
        seen_dbcall = 1;
    }
    seen_dbcall && $0 ~ "param\\[1\\]="dbparam && $0 ~ "\\[DETAILS:"sessionid {
        print date, time, ipaddress, sessionid, cmd;
        seen_command = 0;
        seen_dbcall = 0;
    }
' $2

Upvotes: 1

Views: 316

Answers (2)

ThisSuitIsBlackNot
ThisSuitIsBlackNot

Reputation: 24063

With Perl, you can do this in one pass through the file using a hash:

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;

use Regexp::Common qw(net time);

my $dbcall  = 'call PKG1.Proc1';
my $command = 'action_login1';
my $param1  = 'loginname1';

# Time::Format-compatible pattern
my $date_format = 'yyyy-mm-dd hh:mm{in}:ss.mmm';

my $command_regex = qr/^($RE{time}{tf}{-pat => $date_format}).*\[DETAILS:\d+,\d+,(\w+),$command,($RE{net}{IPv4}),/;
my $dbcall_regex  = qr/execute {$dbcall\(.*\) } \[DETAILS:(\w+)\]/;
my $param1_regex  = qr/param\[1\]=$param1 \[DETAILS:(\w+)\]/;

my %hash;
while (<DATA>) {
    if (/$command_regex/) {
        $hash{$2} = {
            date => $1,
            ip   => $3
        };
    }
    elsif (/$dbcall_regex/) {
        $hash{$1}{seen} = 1;
    }
    elsif (/$param1_regex/) {
        if (exists $hash{$1}{seen}) {
            say join ' ', $hash{$1}{date}, $hash{$1}{ip}, $param1, $1, $command;
            delete $hash{$1};
        }
    }
}

__DATA__
2014-02-25 09:13:57.765 CET [----s-d] [TL]  [DETAILS:22,6,W1OOKol6IF2DImfgVgJikUb,action_login1,10.1.1.1,n/a,n/a,Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0E)]
2014-02-25 09:13:57.819 CET [------d] [DB] execute {call PKG1.Proc1(?,?,?,?,?,?) } [DETAILS:W1OOKol6IF2DImfgVgJikUb]
2014-02-25 09:13:57.819 CET [------d] [DB] param[1]=loginname1 [DETAILS:W1OOKol6IF2DImfgVgJikUb]
2014-02-25 09:17:17.086 CET [----s-d] [TL]  [DETAILS:22,13,l3Na0H2bNOTv4AiaelSOS97,action_login1,10.1.1.1,n/a,n/a,Mozilla/5.0 (Windows NT 5.1; rv:27.0) Gecko/20100101 Firefox/27.0]
2014-02-25 09:17:17.087 CET [------d] [DB] execute {call PKG1.Proc1(?,?,?,?,?,?) } [DETAILS:l3Na0H2bNOTv4AiaelSOS97]
2014-02-25 09:17:17.087 CET [------d] [DB] param[1]=loginname1 [DETAILS:l3Na0H2bNOTv4AiaelSOS97]

Output:

2014-02-25 09:13:57.765 10.1.1.1 loginname1 W1OOKol6IF2DImfgVgJikUb action_login1
2014-02-25 09:17:17.086 10.1.1.1 loginname1 l3Na0H2bNOTv4AiaelSOS97 action_login1

Because the relative order of the lines is set, we can delete the corresponding entry from the hash once we print it, keeping memory usage low.

Upvotes: 1

glenn jackman
glenn jackman

Reputation: 246764

I like awk to implement this kind of state machine. Something like:

awk -v cmd="$command" -v param="$param1" -v dbcall="$dbcall" '
    $0 ~ ","cmd"," {
        datetime    = parse_datetime_from_line()
        identifier  = parse_identifier_from_line()
        address     = parse_address_from_line()
        seen_command = 1
        seen_dbcall = 0
    }
    seen_command && $0 ~ dbcall {
        seen_dbcall = 1
    }
    seen_dbcall && $0 ~ param1 {
        print datetime, address, identifier, command
        seen_command = 0
        seen_dbcall = 0
    }
' file

You don't describe the input file concretely enough, so extracting the important elements from the line is left as an exercise.

Upvotes: 1

Related Questions