Reputation: 87
I am trying to create a PHP function that downloads images from a webpage that you put in as a parameter. However, the webpage itself though is a kind of gallery which only has very small thumbnail versions of the images, each linking directly to the larger full jpeg images that I want to download to my local computer. So the images will not downloaded directly from the webpage itself that I put into the function, but rather from the individual links to these jpeg image files on the webpage.
So for example:
www.somesite.com/galleryfullofimages/
is the location of the image gallery,
and each jpeg image file from the gallery that I want is then located at something like:
www.somesite.com/galleryfullofimages/images/01.jpg
www.somesite.com/galleryfullofimages/images/02.jpg
www.somesite.com/galleryfullofimages/images/03.jpg
What I've been trying to do so far is to use the file_get_contents
function to get the full html of the webpage as a string, and then try to isolate all of the <a href="images/01.jpg">
elements inside the quotes and put them inside of an array. Then use this array to locate each image and download them all with a loop.
this is what I have done so far:
<?php
$link = "http://www.somesite.com/galleryfullofimages/";
$contents = file_get_contents($link);
$results = preg_split('/<a href="[^"]*"/', $contents);
?>
But I am stuck at this point. I am also totally new to regular expressions, which as you can see I tried to use. How can I isolate each image link and then download the image? Or is there a better way of doing this altogether? I have also read about using cURL. But I can't seem to implement that either.
I hope this all makes sense. Any help will be greatly appreciated.
Upvotes: 1
Views: 1510
Reputation: 17487
This is commonly known as "scraping" a website. You already are retrieving the markup for the page, so you are off to a good start.
Here's what you need to do next:
<?php
// Load the retrieved markup into a DOM object using PHP's
// DOMDocument::loadHTML method.
$docObj = new DOMDocument();
$docObj->loadHTML($contents);
// Create a XPath object.
$xpathObj = new DOMXpath($docObj);
// Query for all a tags. You can get very creative here, depending on your
// understanding of XPath. For example, you could change the query to just
// return the href attribute directly. This code returns all anchor tags in
// the page, if the href attribute ends in ".jpg".
$elements = $xpathObj->query('//a[ends-with(@href,".jpg")]');
// Process the discovered image URL's. You could use cURL for this,
// or file_get_contents again (since your host has allow_url_fopen enabled)
// to fetch the image directly and then store it locally.
foreach ($elements as $domNode)
{
$url = $domNode->getAttribute('href');
}
?>
DOMDocument::loadHTML
XPath
XPath::query
allow_url_fopen
Upvotes: 4