user1508213
user1508213

Reputation: 95

Perl Regex Capture

I have the following text:

GET /mac/_base_v1/images/chrome/background_repeat.jpg HTTP/1.1  
Host: www.microsoft.com  
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0  
Accept: image/png,image/*;q=0.8,*/*;q=0.5  
Accept-Language: en-us,en;q=0.5  
Accept-Encoding: gzip, deflate  
Referer: http://www.microsoft.com/mac/base-css  
DNT: 1  
Connection: keep-alive  
HTTP/1.1 200 OK  
Cache-Control: max-age=900  
Content-Type: image/jpegGET /mac/_base_v1/modules/button/images  /buttonlarge_yellownormal.png HTTP/1.1  
Host: www.microsoft.com  
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0   
Accept: image/png,image/*;q=0.8,*/*;q=0.5  
Accept-Language: en-us,en;q=0.5  
Accept-Encoding: gzip, deflate  
Referer: http://www.microsoft.com/mac/css  
DNT: 1  

and the following Perl regex

while ($1 =~m/((GET|PUT|POST|CONNECT)\s+\S+)(?:(?!GET|PUT|POST|CONNECT\s+\S+).)*?Host:\s([^\n]+).*?User-Agent:\s([^\n]+).*?Referer:\s([^\n]+).*?Connection:/msg) {
    # do something
}

and it matches this fine

GET /mac/_base_v1/modules/button/images/buttonlarge_yellownormal.png  
www.microsoft.com  
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0  
http://www.microsoft.com/mac/css  

However, I also need it to examine the following text:

GET /vi/k_dbVP4r4V4/hqdefault.jpg HTTP/1.1  
Host: i.ytimg.com  
User-Agent: Apple iPad v4.3.5 YouTube v1.0.0.8L1  
Accept-Language: en-us, *;q=0.5  
Gdata-Version: 2  
X-Gdata-Client: ytapi-apple-ipad  
Accept: */*  
Accept-Encoding: gzip, deflate  
Connection: keep-alive  
Q2J}  

and match the following:

GET /vi/k_dbVP4r4V4/hqdefault.jpg HTTP/1.1  
i.ytimg.com  
Apple iPad v4.3.5 YouTube v1.0.0.8L1  

while still being able to match the previous text presented successfully.

Upvotes: 2

Views: 510

Answers (2)

ikegami
ikegami

Reputation: 385546

HTTP requests and responses headers are not as trivial to parse as expected. For example, the following are all equivalent:

Accept-Encoding: gzip, deflate

Accept-Encoding: gzip,
    deflate

Accept-Encoding: gzip
Accept-Encoding: deflate

As such, I recommend you use an existing parser

use strict;
use warnings;
use feature qw( say );
use HTTP::Request qw( );

my $s = <<'__EOI__';
GET /mac/_base_v1/images/chrome/background_repeat.jpg HTTP/1.1  
Host: www.microsoft.com  
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0  
Accept: image/png,image/*;q=0.8,*/*;q=0.5  
Accept-Language: en-us,en;q=0.5  
Accept-Encoding: gzip, deflate  
Referer: http://www.microsoft.com/mac/base-css  
DNT: 1  
Connection: keep-alive  
HTTP/1.1 200 OK  
Cache-Control: max-age=900  
Content-Type: image/jpegGET /mac/_base_v1/modules/button/images  /buttonlarge_yellownormal.png HTTP/1.1  
Host: www.microsoft.com  
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0   
Accept: image/png,image/*;q=0.8,*/*;q=0.5  
Accept-Language: en-us,en;q=0.5  
Accept-Encoding: gzip, deflate  
Referer: http://www.microsoft.com/mac/css  
DNT: 1  
__EOI__

my ($raw_req, $raw_resp) = split qr{(?=^HTTP/)}m, $s;
my $req = HTTP::Request->parse($raw_req);
say $req->method;
say $req->url;
say $req->user_agent;
say $req->header('User-Agent');  # Same as previous

Upvotes: 2

Sarah Roberts
Sarah Roberts

Reputation: 860

So, if I understand your question correctly, you need the Referrer header to be optional. You can do that by adding non-capturing parentheses around the portion of your regex that matches that header and placing a question mark after your closing parenthesis:

(?:Referer:\s([^\n]+))?

If any other headers are optional, you can do the same thing with them.

EDIT: The data stops being captured after the first missing header.

This isn't perfect yet, because it doesn't work if there are multiple HTTP requests in a single data file, but it should get you going in the right direction:

use warnings;
use strict;

my $str = <<'END_OF_STR';
GET /vi/k_dbVP4r4V4/hqdefault.jpg HTTP/1.1
Host: i.ytimg.com
User-Agent: Apple iPad v4.3.5 YouTube v1.0.0.8L1
Accept-Language: en-us, *;q=0.5
Gdata-Version: 2
X-Gdata-Client: ytapi-apple-ipad
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
END_OF_STR

my @lines = split m/[\n]/xms, $str;

# Build the regex to match the HTTP methods we care about.
my @methods = qw(GET PUT POST CONNECT);
my $methods_re = join '|', map { quotemeta $_ } @methods;

# Skip to the first request line and print it.
while ( $lines[0] !~ m/ \A $methods_re /xms ) {
    shift @lines;
}
print "$lines[0]\n";
shift @lines;

# Build the regex to match the headers we care about.
my @headers = qw(Host User-Agent Referer Connection);
my $headers_re = join '|', map { quotemeta $_ } @headers;

# Find the headers that we matched.
for my $line (@lines) {
    if ( $line =~ m/ \A (?:$headers_re):\s*(.*) /xms ) {
        print "$1\n";
    }
}

exit;

I'll add another update shortly that will account for multiple HTTP requests in a single file.

EDIT: This solution correctly prints the values you're looking for, but it only prints them. If you want to get the values for each specific request then something more complex will be required.

use warnings;
use strict;

my $str = <<'END_OF_STR';
GET /vi/k_dbVP4r4V4/hqdefault.jpg HTTP/1.1
Host: i.ytimg.com
User-Agent: Apple iPad v4.3.5 YouTube v1.0.0.8L1
Accept-Language: en-us, *;q=0.5
Gdata-Version: 2
X-Gdata-Client: ytapi-apple-ipad
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
END_OF_STR

my @lines = split m/[\n]/xms, $str;

# Build the regexes to match the HTTP methods and headers we care about.
my @methods = qw(GET PUT POST CONNECT);
my $methods_re = join '|', map { quotemeta $_ } @methods;
my @headers = qw(Host User-Agent Referer Connection);
my $headers_re = join '|', map { quotemeta $_ } @headers;

for my $line (@lines) {
    if ( $line =~ m/ \A $methods_re /xms ) {
        print "$line\n";
    }
    elsif ( $line =~ m/ \A (?:$headers_re):\s*(.*) /xms ) {
        print "$1\n";
    }
}

exit;

Upvotes: 2

Related Questions