jokerir
jokerir

Reputation: 13

regular expression (perl like)

$str1="ssh_2-4^accept IN=ETH2 OUT=eth33 MAC=00:d0:c9:96:62:c0:00:1c:f0:98:19:57:08:00 SRC=192.168.200.30 DST=192.168.200.224 LEN=48 TOS=0x00 PREC=0x00 TTL=128 ID=30546 DF PROTO=TCP SPT=10159 DPT=4319 WINDOW=7300 RES=0x00 SYN URGP=0";

$str2="ssh_2-4^accept IN=ETH2 OUT=eth33 MAC=00:d0:c9:96:62:c0:00:1c:f0:98:19:57:08:00 SRC=192.168.200.30 DST=192.168.200.224 LEN=48 TOS=0x00 PREC=0x00 TTL=128 ID=30546 DF PROTO=ICMP WINDOW=7300 RES=0x00 URGP=0";

I need to capture:

for $str1 ==> ssh_2-4, accept, ETH@, eth33, 192.168.200.30, 192.168.200.224, TCP, 10159, 4319

for $str2 ==> ssh_2-4, accept, ETH@, eth33, 192.168.200.30, 192.168.200.224, ICMP

I use below regexp and work very well for the $str1, but don't work with $str2:

(\w*)\^(\w*).*IN=(\S*).*OUT=(\S*).*SRC=(\S* ).*DST=(\S*).*PROTO=(\S*).*SPT=(\d*).*DPT=(\d*).*

What is the suitable regexp for this purpose?

Upvotes: 0

Views: 173

Answers (4)

leonbloy
leonbloy

Reputation: 76006

A split would seem more robust and clean to me. For example:

$str2=~  /^(.*?)\^(\w*)\s+(.*)$/;
my($version,$action,$args) = ($1,$2,$3);
my %argsmap =  split(/[= ]/, $args);
print "proto=$argsmap{'PROTO'} \n";

Edited: I wrongly assumed that each "field" had a key=value pair. Fixed version:

  my(@args) = split(/ /,$str2);
  my($version,$action) = split(/\^/,shift @args);
  my %argsmap = map { $_ =~ /(.*)=(.*)/ ? ($1,$2) : ($_,'') } @args;

Upvotes: 2

TLP
TLP

Reputation: 67910

A more fleshed out split version, based on leonbloy's answer. A direct split will not work due to odd number of elements. So instead we split explicitly on = and allow empty values to be undefined to preserve the hash key/value pairs.

Code:

use strict;
use warnings;

my $str1="ssh_2-4^accept IN=ETH2 OUT=eth33 MAC=00:d0:c9:96:62:c0:00:1c:f0:98:19:57:08:00 SRC=192.168.200.30 DST=192.168.200.224 LEN=48 TOS=0x00 PREC=0x00 TTL=128 ID=30546 DF PROTO=TCP SPT=10159 DPT=4319 WINDOW=7300 RES=0x00 SYN URGP=0";
my $str2="ssh_2-4^accept IN=ETH2 OUT=eth33 MAC=00:d0:c9:96:62:c0:00:1c:f0:98:19:57:08:00 SRC=192.168.200.30 DST=192.168.200.224 LEN=48 TOS=0x00 PREC=0x00 TTL=128 ID=30546 DF PROTO=ICMP WINDOW=7300 RES=0x00 URGP=0";

my @data;
for my $str ($str1, $str2) {
    my %hash;
    # First we extract the "header"
    $str =~ s/^([^^]+)\^(\w+) // || die "Did not match header";
    $hash{'version'} = $1;
    $hash{'action'} = $2;

    # Now process the args
    for my $line (split ' ', $str) {
        my ($key, $val) = split /=/, $line;
        $hash{$key} = $val;
    }
    # Save the hash into an array
    push @data, \%hash;
}

for my $href (@data) {
    # Now output the selected elements from each hash
    my $out = join ", ",
        @$href{'version','action','IN','OUT','SRC','DST','PROTO'};
    if ($href->{'PROTO'} eq 'TCP') {
        $out = join ", ", $out, @$href{'SPT', 'DPT'};
    }
    print "$out\n";
}

Output:

ssh_2-4, accept, ETH2, eth33, 192.168.200.30, 192.168.200.224, TCP, 10159, 4319
ssh_2-4, accept, ETH2, eth33, 192.168.200.30, 192.168.200.224, ICMP

Upvotes: 0

Axeman
Axeman

Reputation: 29854

The greedy quantifiers means that each time the expression makes a match, it matches .* to all the rest of the characters in the line. Which means that each time it matches it has to consume the input, fail to find the next expression, and then backtrack until it does. This is highly inefficient.

Instead you want to use the non-greedy form: .*?. And then to make sure you get whole words/keys, you could use the word-break specifier: \b, like so:

my $re 
    = qr/
        ([\w-]*) \^ (\w*) .*? 
        \bIN=(\S*)  .*?
        \bOUT=(\S*) .*?
        \bSRC=(\S*) .*?
        \bDST=(\S*) .*?
        \bPROTO=(\S*)
        (?: .*? 
            \bSPT=(\d*) 
            .*?
            \bDPT=(\d*)
        )?
    /x;

Now, since you don't have SPT and DPT fields in each line, you want to make that match conditional (?:...)?

And that's all I needed to do:

while ( <$data> ) {
    my @flds = m/$re/;
    print join( ',', grep { defined and length } @flds ), "\n"; 
}

Upvotes: 0

rjp
rjp

Reputation: 1958

$str1="ssh_2-4^accept IN=ETH2 OUT=eth33 MAC=00:d0:c9:96:62:c0:00:1c:f0:98:19:57:08:00 SRC=192.168.200.30 DST=192.168.200.224 LEN=48 TOS=0x00 PREC=0x00 TTL=128 ID=30546 DF PROTO=TCP SPT=10159 DPT=4319 WINDOW=7300 RES=0x00 SYN URGP=0";
$str2="ssh_2-4^accept IN=ETH2 OUT=eth33 MAC=00:d0:c9:96:62:c0:00:1c:f0:98:19:57:08:00 SRC=192.168.200.30 DST=192.168.200.224 LEN=48 TOS=0x00 PREC=0x00 TTL=128 ID=30546 DF PROTO=ICMP WINDOW=7300 RES=0x00 URGP=0";

foreach my $i ($str1, $str2) {
    if ($i =~ /^(.+)\^(\w+)\s+IN=(\S+)\s+OUT=(\S+).*?SRC=(\S+)\s+DST=(\S+).*?PROTO=(\S+)(?:.*?SPT=(\d+)\s+DPT=(\d+))?/) {
        print "/1=$1/2=$2/3=$3/4=$4/5=$5/6=$6/7=$7/8=$8/9=$9\n";
    }
}

This gives

/1=ssh_2-4/2=accept/3=ETH2/4=eth33/5=192.168.200.30/6=192.168.200.224/7=TCP/8=10159/9=4319
/1=ssh_2-4/2=accept/3=ETH2/4=eth33/5=192.168.200.30/6=192.168.200.224/7=ICMP/8=/9=

Capture the SPT and DPT parts in an optional sub-bracket: (?:.*?SPT=(\d+)\s+DPT=(\d+))?

Upvotes: 0

Related Questions