Michael_Branco
Michael_Branco

Reputation: 17

Perl LWP::Simple won't GET some URLs

I am trying to write a basic webscraping program in Perl. For some reason it is not working correctly and I don't have the slightest clue as to why.

Just the first part of my code where I am getting the content (just saving all of the HTML code from the webpage to a variable) does not work with certain websites.

I am testing it by just printing it out, and it does not print anything out with this specific website. It works with some other sites, but not all.

Is there another way of doing this that will work?

#use strict;
use LWP::Simple qw/get/;
use LWP::Simple qw/getstore/;


## Grab a Web page, and throw the content in a Perl variable.
my $content = get("https://jobscout.lhh.com/Portal/Page/ResumeProfile.aspx?Mode=View&ResumeId=53650");
print $content;

Upvotes: 0

Views: 671

Answers (2)

Borodin
Borodin

Reputation: 126722

You have a badly-written web site there. The request times out with a 500 Internal Server Error.

I can't suggest how to get around it, but the site almost certainly uses JavaScript as well which LWP doesn't support, so I doubt if an answer would be much use to you.


Update

It looks like the site has been written so that it goes crazy if there is no Accept-Language header in the request.

The full LWP::UserAgent module is necessary to set it up, like this

use strict;
use warnings;

use LWP;

my $ua = LWP::UserAgent->new(timeout => 10);
my $url = 'https://jobscout.lhh.com/Portal/Page/ResumeProfile.aspx?Mode=View&ResumeId=53650';

my $resp = $ua->get($url, accept_language => 'en-gb,en', );
print $resp->status_line, "\n\n";
print $resp->decoded_content;

This returns with a status of 200 OK and some HTML.

Upvotes: 4

Miller
Miller

Reputation: 35198

To interact with a website that uses Javascript, I would advise that you use the following module:WWW::Mechanize::Firefox

use strict;
use warnings;

use WWW::Mechanize::Firefox;

my $url = "https://jobscout.lhh.com/Portal/Page/ResumeProfile.aspx?Mode=View&ResumeId=53650"

my $mech = WWW::Mechanize::Firefox->new();
$mech->get($url);

print $mech->status();

my $content = $mech->content();

Upvotes: 0

Related Questions