Reputation: 95
I have the following text:
GET /mac/_base_v1/images/chrome/background_repeat.jpg HTTP/1.1
Host: www.microsoft.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0
Accept: image/png,image/*;q=0.8,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://www.microsoft.com/mac/base-css
DNT: 1
Connection: keep-alive
HTTP/1.1 200 OK
Cache-Control: max-age=900
Content-Type: image/jpegGET /mac/_base_v1/modules/button/images /buttonlarge_yellownormal.png HTTP/1.1
Host: www.microsoft.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0
Accept: image/png,image/*;q=0.8,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://www.microsoft.com/mac/css
DNT: 1
and the following Perl regex
while ($1 =~m/((GET|PUT|POST|CONNECT)\s+\S+)(?:(?!GET|PUT|POST|CONNECT\s+\S+).)*?Host:\s([^\n]+).*?User-Agent:\s([^\n]+).*?Referer:\s([^\n]+).*?Connection:/msg) {
# do something
}
and it matches this fine
GET /mac/_base_v1/modules/button/images/buttonlarge_yellownormal.png
www.microsoft.com
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0
http://www.microsoft.com/mac/css
However, I also need it to examine the following text:
GET /vi/k_dbVP4r4V4/hqdefault.jpg HTTP/1.1
Host: i.ytimg.com
User-Agent: Apple iPad v4.3.5 YouTube v1.0.0.8L1
Accept-Language: en-us, *;q=0.5
Gdata-Version: 2
X-Gdata-Client: ytapi-apple-ipad
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Q2J}
and match the following:
GET /vi/k_dbVP4r4V4/hqdefault.jpg HTTP/1.1
i.ytimg.com
Apple iPad v4.3.5 YouTube v1.0.0.8L1
while still being able to match the previous text presented successfully.
Upvotes: 2
Views: 510
Reputation: 385546
HTTP requests and responses headers are not as trivial to parse as expected. For example, the following are all equivalent:
Accept-Encoding: gzip, deflate
Accept-Encoding: gzip,
deflate
Accept-Encoding: gzip
Accept-Encoding: deflate
As such, I recommend you use an existing parser
use strict;
use warnings;
use feature qw( say );
use HTTP::Request qw( );
my $s = <<'__EOI__';
GET /mac/_base_v1/images/chrome/background_repeat.jpg HTTP/1.1
Host: www.microsoft.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0
Accept: image/png,image/*;q=0.8,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://www.microsoft.com/mac/base-css
DNT: 1
Connection: keep-alive
HTTP/1.1 200 OK
Cache-Control: max-age=900
Content-Type: image/jpegGET /mac/_base_v1/modules/button/images /buttonlarge_yellownormal.png HTTP/1.1
Host: www.microsoft.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0
Accept: image/png,image/*;q=0.8,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://www.microsoft.com/mac/css
DNT: 1
__EOI__
my ($raw_req, $raw_resp) = split qr{(?=^HTTP/)}m, $s;
my $req = HTTP::Request->parse($raw_req);
say $req->method;
say $req->url;
say $req->user_agent;
say $req->header('User-Agent'); # Same as previous
Upvotes: 2
Reputation: 860
So, if I understand your question correctly, you need the Referrer header to be optional. You can do that by adding non-capturing parentheses around the portion of your regex that matches that header and placing a question mark after your closing parenthesis:
(?:Referer:\s([^\n]+))?
If any other headers are optional, you can do the same thing with them.
EDIT: The data stops being captured after the first missing header.
This isn't perfect yet, because it doesn't work if there are multiple HTTP requests in a single data file, but it should get you going in the right direction:
use warnings;
use strict;
my $str = <<'END_OF_STR';
GET /vi/k_dbVP4r4V4/hqdefault.jpg HTTP/1.1
Host: i.ytimg.com
User-Agent: Apple iPad v4.3.5 YouTube v1.0.0.8L1
Accept-Language: en-us, *;q=0.5
Gdata-Version: 2
X-Gdata-Client: ytapi-apple-ipad
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
END_OF_STR
my @lines = split m/[\n]/xms, $str;
# Build the regex to match the HTTP methods we care about.
my @methods = qw(GET PUT POST CONNECT);
my $methods_re = join '|', map { quotemeta $_ } @methods;
# Skip to the first request line and print it.
while ( $lines[0] !~ m/ \A $methods_re /xms ) {
shift @lines;
}
print "$lines[0]\n";
shift @lines;
# Build the regex to match the headers we care about.
my @headers = qw(Host User-Agent Referer Connection);
my $headers_re = join '|', map { quotemeta $_ } @headers;
# Find the headers that we matched.
for my $line (@lines) {
if ( $line =~ m/ \A (?:$headers_re):\s*(.*) /xms ) {
print "$1\n";
}
}
exit;
I'll add another update shortly that will account for multiple HTTP requests in a single file.
EDIT: This solution correctly prints the values you're looking for, but it only prints them. If you want to get the values for each specific request then something more complex will be required.
use warnings;
use strict;
my $str = <<'END_OF_STR';
GET /vi/k_dbVP4r4V4/hqdefault.jpg HTTP/1.1
Host: i.ytimg.com
User-Agent: Apple iPad v4.3.5 YouTube v1.0.0.8L1
Accept-Language: en-us, *;q=0.5
Gdata-Version: 2
X-Gdata-Client: ytapi-apple-ipad
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
END_OF_STR
my @lines = split m/[\n]/xms, $str;
# Build the regexes to match the HTTP methods and headers we care about.
my @methods = qw(GET PUT POST CONNECT);
my $methods_re = join '|', map { quotemeta $_ } @methods;
my @headers = qw(Host User-Agent Referer Connection);
my $headers_re = join '|', map { quotemeta $_ } @headers;
for my $line (@lines) {
if ( $line =~ m/ \A $methods_re /xms ) {
print "$line\n";
}
elsif ( $line =~ m/ \A (?:$headers_re):\s*(.*) /xms ) {
print "$1\n";
}
}
exit;
Upvotes: 2