bootware
bootware

Reputation: 72

regex with variable part

How can I merge these 2 regex's to a single regex which captures all available parts depending on the string structure ( the last 3 fields in $s are optional and should be captured if they exists)? Using (?= ... ) I could not get a working solution.

$s='1.2.3.4 - egon  [10/Dec/2007:21:07:20 +0100] "GET /x.htm HTTP/1.1" 401 488';
$re = qr/\A
        (\d+)\.(\d+)\.(\d+)\.(\d+)
    [ ] (\S+)
    [ ] (\S+)
    [ ]+ \[(\d+)\/(\S+)\/(\d+):(\d+):(\d+):(\d+) [ ] (\S+)\]
    [ ] "(\S+) [ ] (.*?) [ ] (\S+)"
    [ ] (\S+)
    [ ] (\S+)
    \Z/x;
print "[".join('],[',$s =~ $re)."]\n\n";   

$s='1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /x.htm HTTP/1.0" 404 283 "-" "Mozilla/5.0..." "-"';
$re = qr/\A
        (\d+)\.(\d+)\.(\d+)\.(\d+)
    [ ] (\S+)
    [ ] (\S+)
    [ ]+ \[(\d+)\/(\S+)\/(\d+):(\d+):(\d+):(\d+) [ ] (\S+)\]
    [ ] "(\S+) [ ] (.*?) [ ] (\S+)"
    [ ] (\S+)
    [ ] (\S+) [ ] "(.*?)" [ ] "(.*?)" [ ] "(.*?)"
        \Z
        /x;
print "[".join('],[',$s =~ $re)."]\n\n";   

Upvotes: 0

Views: 166

Answers (3)

bootware
bootware

Reputation: 72

I did not want to talk any solution hints down.

How I said before: Nice idea. But it only does what the package name predicates: ParseWords.

"Find me a test case where your regex works and my solution fails if you want to continue this discussion ...".

Of course I have testet your solution for my purposes.

In your solution the fields are shifted around, depending on the input.

With the regex I'll find the fields always at defined positions.

(for example: Authuser at $token[5] and Year at $token[9] )

Here is the test:

#!/usr/bin/perl -w
use strict;
use warnings;
use FileHandle;
use Text::ParseWords;

my $re = qr/\A
        (\d+)\.(\d+)\.(\d+)\.(\d+)
    [ ] (\S+)
    (?: [ ] (\S*))? (?: [ ] (\S*))?
    [ ] \[(\d+)\/(\S+)\/(\d+):(\d+):(\d+):(\d+) [ ] (\S+)\]
    [ ] "(?:(\S+) [ ])? (.*?) (?:[ ] (\S+))?"
    [ ] (\S+)
    [ ] (\S+)
    (?:
        [ ] "(.*?)"
        [ ] "(.*?)"
        [ ] "(.*?)"
    )?
    \Z/x;

my (@s,@token);
#---- most entries ------------------------------------------------------------
push(@s,'1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /x.htm HTTP/1.0" 404 283');
#---- referer, user agent, ... ------------------------------------------------
push(@s,'1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /x.htm HTTP/1.0" 404 283 "-" "Mozilla/5.0..." "-"');
#---- auth without password ---------------------------------------------------
push(@s,'1.2.3.4 - ausr  [10/Dec/2007:21:07:20 +0100] "GET /x.htm HTTP/1.1" 401 488');
#---- no http request --------------------------------------------------------- 
push(@s,'1.2.3.4 - - [13/Jun/2007:19:16:18 +0200] "-" 408 -');
#---- auth with password ------------------------------------------------------
push(@s,'1.2.3.4 - ausr pwd [12/Jul/2006:16:55:04 +0200] "GET /x.htm HTTP/1.1" 401 489');
#---- auth without user -------------------------------------------------------
push(@s,'1.2.3.4 -  pwd [16/Aug/2007:08:43:50 +0200] "GET /x.htm HTTP/1.1" 401 489');
#---- multiple words in request -----------------------------------------------
push(@s,'1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /this is test HTTP/1.0" 404 283'); 

no warnings 'uninitialized';
foreach(@s)
{ @token=$_ =~ $re;
  print "regex:      AUTHUSER=".$token[5].", YEAR=".$token[9]."\n";
  @token=quotewords('[\s/:\[\].]+', 0, $_);
  print "quotewords: AUTHUSER=".$token[5].", YEAR=".$token[9]."\n\n";
}

and here the results:

regex:      AUTHUSER=-, YEAR=2007
quotewords: AUTHUSER=-, YEAR=01

regex:      AUTHUSER=-, YEAR=2007
quotewords: AUTHUSER=-, YEAR=01

regex:      AUTHUSER=ausr, YEAR=2007
quotewords: AUTHUSER=ausr, YEAR=21

regex:      AUTHUSER=-, YEAR=2007
quotewords: AUTHUSER=-, YEAR=19

regex:      AUTHUSER=ausr, YEAR=2006
quotewords: AUTHUSER=ausr, YEAR=2006

regex:      AUTHUSER=, YEAR=2007
quotewords: AUTHUSER=pwd, YEAR=08

regex:      AUTHUSER=-, YEAR=2007
quotewords: AUTHUSER=-, YEAR=01

Upvotes: 0

TLP
TLP

Reputation: 67920

When your regexes start looking like that, I think its a good idea to start thinking about alternatives. In this case, you might try Text::ParseWords, since your strings are sort of delimited and contain quoted strings. It is a core module in perl 5.

Basically what we're doing is supplying a regex for the delimiters that we expect, a 0 or 1 for keeping the quotes, and the input lines themselves.

use strict;
use warnings;
use Text::ParseWords;

my $s = '1.2.3.4 - egon  [10/Dec/2007:21:07:20 +0100] "GET /x.htm HTTP/1.1" 401 488';
my @s = quotewords('[\s/:\[\].]+', 0, $s);
print "[".join('],[',@s)."]\n\n";   

$s = '1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /x.htm HTTP/1.0" 404 283 "-" "Mozilla/5.0..." "-"';
@s = quotewords('[\s/:\[\].]+', 0, $s);
print "[".join('],[',@s)."]\n\n";   

Output:

[1],[2],[3],[4],[-],[egon],[10],[Dec],[2007],[21],[07],[20],[+0100],[GET /x.htm
HTTP/1.1],[401],[488]

[1],[2],[3],[4],[-],[-],[13],[Jun],[2007],[01],[37],[44],[+0200],[GET /x.htm HTT
P/1.0],[404],[283],[-],[Mozilla/5.0...],[-]

Upvotes: 4

bonsaiviking
bonsaiviking

Reputation: 6005

Instead of using a lookahead (?=), you can use a non-capturing group (?:) and match zero or one occurrence:

$re = qr/\A
        (\d+)\.(\d+)\.(\d+)\.(\d+)
    [ ] (\S+)
    [ ] (\S+)
    [ ]+ \[(\d+)\/(\S+)\/(\d+):(\d+):(\d+):(\d+) [ ] (\S+)\]
    [ ] "(\S+) [ ] (.*?) [ ] (\S+)"
    [ ] (\S+)
    [ ] (\S+)
    (?:
        [ ] "(.*?)"
        [ ] "(.*?)"
        [ ] "(.*?)"
    )?
    \Z/x;

This will yield fixed-length array of captures, but the last 3 will be undef if the optional capture group does not match. If you have to match between 1 and 3 optional fields, wrap each in its own non-capturing group with zero or more (?) occurrences. I also tried this, but it doesn't work:

(?: [ ] "(.*?)" ){0,3} \Z

It matches, and captures each of the last three fields, but each capture overwrites the final position in the capture array, so after the capture is done, it contains just the final field.

I would caution you that you are using a very strict expression that may not be suited to all web logs: specifically, the match for IP address will not handle IPv6 addresses, and the match for User-agent may not handle user agents with " characters, depending on how they are escaped (lighttpd 1.4.28 does not escape them, for instance).

Upvotes: 2

Related Questions