Caractacus
Caractacus

Reputation: 83

How can I scrape images off of a webage using Perl

I would like to scrape all images off of a webpage and am running into a problem I don't understand.

For instance if I use enter https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=%22escape%20room%22+movie+poster&safe=images into my browser and then use the browser's "View Source" option I get a massive amount of text/code. Using "find" I get more than 400 instances of

https://

So the simple code I wrote (below) gets the content and writes the result to a file. But a grep search of https:// only returns 7 instances. So obviously I am doing something incorrectly, perhaps the page is dynamic and I can't access that part?

Is there a way I can get the same source, via Perl, that I get via View Source?

my $ua = new LWP::UserAgent;
$ua->agent("$0/0.1 " . $ua->agent);
$ua->agent("Mozilla/8.0");

my $sstring = "https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=%22escape%20room%22+movie+poster&safe=images";     

my $req = new HTTP::Request 'GET' => $sstring;
$req->header('Accept' => 'text/html');

my $res = $ua->request($req);

open(my $fh, '>', 'report.txt');
print $fh $res->decoded_content;
close $fh;

Here's the example I got from WWW:Mechanize::Chrome

my $mech = WWW::Mechanize::Chrome->new();
my $sstring = "https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=\"" . $arrayp . "\"" . "+movie+poster&safe=images";        
$mech->get($sstring);

sleep 5;

print $_->get_attribute('href'), "\n\t-> ", $_->get_attribute('innerHTML'), "\n"
          for $mech->selector('a.download');

Upvotes: 2

Views: 141

Answers (1)

Holli
Holli

Reputation: 5082

The Google search uses Javascript to alter the page content after load. LWP::UserAgent does not support Javascript and what you get is only the initial document. (Hint: An easy way to see in the browser what LWP::UserAgent "sees" is using a browser addon to disable Javascript).

You will need to use something that is called a "headless Browser", for example WWW::Mechanize::Chrome

Upvotes: 4

Related Questions