Reputation: 7032
I want my script to download only text/html content and not binary or images that could take significantly more time to download. I know about the max_size parameter but I would like to add a check on the Content-Type
header. Is this doable ?
Upvotes: 4
Views: 612
Reputation: 4005
As pointed out by others you can perform a HEAD
request before your GET
request. You ought to do this as a way of being polite to the server because it actually is easy for you to abort the connection, but not necessarily easy for the web server to abort sending a bunch of data and doing a bunch of work on its end.
There are some different ways to do this depending on how sophisticated you want to be.
You can send an Accept
header with your request which only lists text/html
. A well-implemented HTTP server will return a 406 Not Acceptable
status if you say you don't accept whatever it is the file is. Of course, they might send it to you anyway. You can do this as your HEAD
request as well.
When using a recent version of LWP::UserAgent, you can use a handler subroutine to abort the rest of the request after the headers and before the content body.
use LWP::UserAgent;
use Try::Tiny;
my $ua = LWP::UserAgent->new;
$ua->add_handler( response_header => sub {
my($response, $ua, $h) = @_;
die "Not HTML" unless $response->content_type eq 'text/html';
});
my $url = "http://example.com/foo";
my $html;
my $head_response = $ua->head($url, Accept => "text/html");
if ($head_response->is_success) {
my $get_response = $ua->get($url, Accept => "text/html");
if ($get_response->is_success) {
$html = $get_response->content;
}
}
See the Handlers section of the LWP::UserAgent documentation for details on handlers.
I haven't caught the exception thrown or made sure to deal with the 406 responses carefully here. I leave that as an exercise for the reader.
Upvotes: 6
Reputation: 126742
If you are using the minimal LWP::Simple
subclass of LWP
then the head
function returns the content type as the first element of a list.
So you can write
use strict;
use warnings;
use LWP::Simple;
for my $url ('http://www.bbc.co.uk') {
my ($ctype) = head $url;
my $content = get $url if $ctype eq 'text/html';
}
Upvotes: 0
Reputation: 3484
You can use the HEAD request to query the URI's header info. If the server responds to heads, you'll get everything that a GET would have returned, except for that pesky body.
You can then decide what to do based on the MIME type.
otherwise, you'll have to rely on the file's extension, before you request it.
Upvotes: 1