Reputation: 83
I would like to scrape all images off of a webpage and am running into a problem I don't understand.
For instance if I use enter https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=%22escape%20room%22+movie+poster&safe=images into my browser and then use the browser's "View Source" option I get a massive amount of text/code. Using "find" I get more than 400 instances of
https://
So the simple code I wrote (below) gets the content and writes the result to a file. But a grep search of https:// only returns 7 instances. So obviously I am doing something incorrectly, perhaps the page is dynamic and I can't access that part?
Is there a way I can get the same source, via Perl, that I get via View Source?
my $ua = new LWP::UserAgent;
$ua->agent("$0/0.1 " . $ua->agent);
$ua->agent("Mozilla/8.0");
my $sstring = "https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=%22escape%20room%22+movie+poster&safe=images";
my $req = new HTTP::Request 'GET' => $sstring;
$req->header('Accept' => 'text/html');
my $res = $ua->request($req);
open(my $fh, '>', 'report.txt');
print $fh $res->decoded_content;
close $fh;
Here's the example I got from WWW:Mechanize::Chrome
my $mech = WWW::Mechanize::Chrome->new();
my $sstring = "https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=\"" . $arrayp . "\"" . "+movie+poster&safe=images";
$mech->get($sstring);
sleep 5;
print $_->get_attribute('href'), "\n\t-> ", $_->get_attribute('innerHTML'), "\n"
for $mech->selector('a.download');
Upvotes: 2
Views: 141
Reputation: 5082
The Google search uses Javascript
to alter the page content after load. LWP::UserAgent
does not support Javascript
and what you get is only the initial document. (Hint: An easy way to see in the browser what LWP::UserAgent
"sees" is using a browser addon to disable Javascript
).
You will need to use something that is called a "headless Browser", for example WWW::Mechanize::Chrome
Upvotes: 4