Reputation: 1
I am trying to run below code to parse the contents of html page for the below URL
#!/usr/bin/perl
use LWP::Simple;
use HTML::TreeBuilder;
$response = get("http://www.viki.com/");
print $response;
Nothing gets printed. This is working if emulated from a browser.
Upvotes: 0
Views: 549
Reputation: 8345
When I try to access http://www.viki.com
using LWP::UserAgent
I get the following response:
<html><body><h1>403 Forbidden</h1>
Request forbidden by administrative rules.
</body></html>
The get
subroutine in LWP::Simple
is implemented as follows (at least in version 6.13).
sub get ($)
{
my $response = $ua->get(shift);
return $response->decoded_content if $response->is_success;
return undef;
}
As you can see, the get
method will only return the content if the response is a success, otherwise it will return undef
.
The response from LWP::UserAgent
is a 403 error, in other words not a success. Therefore, LWP::Simple
will return undef
for the same URL.
It appears that the website (http://www.viki.com
) is checking the user agent string and only returning content to "valid" user agents. LWP::Simple
is hard-coded to use LWP::Simple/$VERSION
as the user agent.
If you really must use LWP::Simple
then you could force the user agent like this:
use LWP::Simple qw/ get $ua /;
$ua->agent('Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0');
print get('http://www.viki.com');
LWP::Simple
exposes the LWP::UserAgent
instance that it uses internally as the optionally included $ua
variable. It is still necessary to configure the user agent on this instance to get this particular page to load.
Upvotes: 3