Perl Regex Capture

Question

I have the following text:

GET /mac/_base_v1/images/chrome/background_repeat.jpg HTTP/1.1  
Host: www.microsoft.com  
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0  
Accept: image/png,image/*;q=0.8,*/*;q=0.5  
Accept-Language: en-us,en;q=0.5  
Accept-Encoding: gzip, deflate  
Referer: http://www.microsoft.com/mac/base-css  
DNT: 1  
Connection: keep-alive  
HTTP/1.1 200 OK  
Cache-Control: max-age=900  
Content-Type: image/jpegGET /mac/_base_v1/modules/button/images  /buttonlarge_yellownormal.png HTTP/1.1  
Host: www.microsoft.com  
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0   
Accept: image/png,image/*;q=0.8,*/*;q=0.5  
Accept-Language: en-us,en;q=0.5  
Accept-Encoding: gzip, deflate  
Referer: http://www.microsoft.com/mac/css  
DNT: 1

and the following Perl regex

while ($1 =~m/((GET|PUT|POST|CONNECT)\s+\S+)(?:(?!GET|PUT|POST|CONNECT\s+\S+).)*?Host:\s([^
]+).*?User-Agent:\s([^
]+).*?Referer:\s([^
]+).*?Connection:/msg) {
    # do something
}

and it matches this fine

GET /mac/_base_v1/modules/button/images/buttonlarge_yellownormal.png  
www.microsoft.com  
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0  
http://www.microsoft.com/mac/css

However, I also need it to examine the following text:

GET /vi/k_dbVP4r4V4/hqdefault.jpg HTTP/1.1  
Host: i.ytimg.com  
User-Agent: Apple iPad v4.3.5 YouTube v1.0.0.8L1  
Accept-Language: en-us, *;q=0.5  
Gdata-Version: 2  
X-Gdata-Client: ytapi-apple-ipad  
Accept: */*  
Accept-Encoding: gzip, deflate  
Connection: keep-alive  
Q2J}

and match the following:

GET /vi/k_dbVP4r4V4/hqdefault.jpg HTTP/1.1  
i.ytimg.com  
Apple iPad v4.3.5 YouTube v1.0.0.8L1

while still being able to match the previous text presented successfully.

ikegami · Accepted Answer

HTTP requests and responses headers are not as trivial to parse as expected. For example, the following are all equivalent:

Accept-Encoding: gzip, deflate

Accept-Encoding: gzip,
    deflate

Accept-Encoding: gzip
Accept-Encoding: deflate

As such, I recommend you use an existing parser

use strict;
use warnings;
use feature qw( say );
use HTTP::Request qw( );

my $s = <<'__EOI__';
GET /mac/_base_v1/images/chrome/background_repeat.jpg HTTP/1.1  
Host: www.microsoft.com  
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0  
Accept: image/png,image/*;q=0.8,*/*;q=0.5  
Accept-Language: en-us,en;q=0.5  
Accept-Encoding: gzip, deflate  
Referer: http://www.microsoft.com/mac/base-css  
DNT: 1  
Connection: keep-alive  
HTTP/1.1 200 OK  
Cache-Control: max-age=900  
Content-Type: image/jpegGET /mac/_base_v1/modules/button/images  /buttonlarge_yellownormal.png HTTP/1.1  
Host: www.microsoft.com  
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0   
Accept: image/png,image/*;q=0.8,*/*;q=0.5  
Accept-Language: en-us,en;q=0.5  
Accept-Encoding: gzip, deflate  
Referer: http://www.microsoft.com/mac/css  
DNT: 1  
__EOI__

my ($raw_req, $raw_resp) = split qr{(?=^HTTP/)}m, $s;
my $req = HTTP::Request->parse($raw_req);
say $req->method;
say $req->url;
say $req->user_agent;
say $req->header('User-Agent');  # Same as previous

Perl Regex Capture

Answers (2)

Related Questions