Reputation: 1656

Download millions of images from external website

I am working on a real estate website and we're about to get an external feed of ~1M listings. Assuming each listing has ~10 photos associated with it, that's about ~10M photos, and we're required to download each of them to our server so as to not "hot link" to them.

I'm at a complete loss as to how to do this efficiently. I played with some numbers and I concluded, based on a 0.5 second per image download rate, this could take upwards of ~58 days to complete (download ~10M images from an external server). Which is obviously unacceptable.

Each photo seems to be roughly ~50KB, but that can vary with some being larger, much larger, and some being smaller.

I've been testing by simply using:

copy(http://www.external-site.com/image1.jpg, /path/to/folder/image1.jpg)

I've also tried cURL, wget, and others.

I know other sites do it, and at a much larger scale, but I haven't the slightest clue how they manage this sort of thing without it taking months at a time.

Sudo code based on the XML feed we're set to receive. We're parsing the XML using PHP:

<listing>
    <listing_id>12345</listing_id>
    <listing_photos>
        <photo>http://example.com/photo1.jpg</photo>
        <photo>http://example.com/photo2.jpg</photo>
        <photo>http://example.com/photo3.jpg</photo>
        <photo>http://example.com/photo4.jpg</photo>
        <photo>http://example.com/photo5.jpg</photo>
        <photo>http://example.com/photo6.jpg</photo>
        <photo>http://example.com/photo7.jpg</photo>
        <photo>http://example.com/photo8.jpg</photo>
        <photo>http://example.com/photo9.jpg</photo>
        <photo>http://example.com/photo10.jpg</photo>
    </listing_photos>
</listing>

So my script will iterate through each photo for a specific listing and download the photo to our server, and also insert the photo name into our photo database (the insert part is already done without issue).

Any thoughts?

Upvotes: 3

Answers (3)

Tek

Reputation: 3050

Before you do this

Like @BrokenBinar said in the comments. Take into account how many requests per second the host can provide. You don't want to flood them with requests without them knowing. Then use something like sleep to limit your requests per whatever number it is they can provide.

Curl Multi

Anyway, use Curl. Somewhat of a duplicate answer but copied anyway:

$nodes = array($url1, $url2, $url3);
$node_count = count($nodes);

$curl_arr = array();
$master = curl_multi_init();

for($i = 0; $i < $node_count; $i++)
{
    $url =$nodes[$i];
    $curl_arr[$i] = curl_init($url);
    curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
    curl_multi_add_handle($master, $curl_arr[$i]);
}

do {
    curl_multi_exec($master,$running);
} while($running > 0);


for($i = 0; $i < $node_count; $i++)
{
    $results[] = curl_multi_getcontent  ( $curl_arr[$i]  );
}
print_r($results);

From: PHP Parallel curl requests

Another solution:

Pthread

<?php

class WebRequest extends Stackable {
    public $request_url;
    public $response_body;

    public function __construct($request_url) {
        $this->request_url = $request_url;
    }

    public function run(){
        $this->response_body = file_get_contents(
            $this->request_url);
    }
}

class WebWorker extends Worker {
    public function run(){}
}

$list = array(
    new WebRequest("http://google.com"),
    new WebRequest("http://www.php.net")
);

$max = 8;
$threads = array();
$start = microtime(true);

/* start some workers */
while (@$thread++<$max) {
    $threads[$thread] = new WebWorker();
    $threads[$thread]->start();
}

/* stack the jobs onto workers */
foreach ($list as $job) {
    $threads[array_rand($threads)]->stack(
        $job);
}

/* wait for completion */
foreach ($threads as $thread) {
    $thread->shutdown();
}

$time = microtime(true) - $start;

/* tell you all about it */
printf("Fetched %d responses in %.3f seconds\n", count($list), $time);
$length = 0;
foreach ($list as $listed) {
    $length += strlen($listed["response_body"]);
}
printf("Total of %d bytes\n", $length);
?>

Source: PHP testing between pthreads and curl

You should really use the search feature, ya know :)

Upvotes: 2

kheld

Reputation: 792

I am surprised the vendor is not allowing you to hot-link. The truth is you will not serve every image every month so why download every image? Allowing you to hot link is a better use of everyone's bandwidth.

I manage a catalog with millions of items where the data is local but the images are mostly hot linked. Sometimes we need to hide the source of the image or the vendor requires us to cache the image. To accomplish both goals we use a proxy. We wrote our own proxy but you might find something open source that would meet your needs.

The way the proxy works is that we encrypt and URL encode the encrypted URL string. So http://yourvendor.com/img1.jpg becomes xtX957z. In our markup the img src tag is something like http://ourproxy.com/getImage.ashx?image=xtX957z.

When our proxy receives an image request, it decrypts the image URL. The proxy first looks on disk for the image. We derive the image name from the URL, so it is looking for something like yourvendorcom.img1.jpg. If the proxy cannot find the image on disk, then it uses the decrypted URL to fetch the image from the vendor. It then writes the image to disk and serves it back to the client. This approach has the advantage of being on demand with no wasted bandwidth. I only get the images I need and I only get them once.

Upvotes: 2

Jakub Filipczyk

Reputation: 1141

You can save all links into some database table (it will be yours "job queue"), Then you can create a script which in the loop gets the job and do it (fetch image for a single link and mark job record as done) The script you can execute multiple times f.e. using supervisord. So the job queue will be processed in parallel. If it's to slow you can just execute another worker script (if bandwidth does not slow you down)

If any script hangs for some reason you can easly run it again to get only images that havnt been yet downloaded. Btw supervisord can be configured to automaticaly restart each script if it fails.

Another advantage is that at any time you can check output of those scripts by supervisorctl. To check how many images are still waiting you can easy query the "job queue" table.

Upvotes: 2

Download millions of images from external website

Answers (3)

Before you do this

Curl Multi

Another solution:

Pthread

Related Questions