Download a large XML file from an external source in the background, with the ability to resume download if incomplete

Question

Some background information

The files I would like to download is kept at the external server for a week, and a new XML file(10-50mb large) is created there every hour with a different name. I would like the large file to be downloaded to my server chunk by chunk in the background each time my website is loaded, perhaps 0.5mb each time, and then resume the download the next time someone else loads the website. This would require my site to have atleast 100 pageloads each hour to stay updated, so perhaps abit more of the file each time if possible. I have researched simpleXML, XMLreader, SAX parsing, but whatever I do, it seems it takes too long to parse the file directly, therefore I would like a different approach, namely downloading it like described above.

If I download a 30mb large XML file, I can parse it locally with XMLreader in 3 seconds(250k iterations) only, but when I try to do the same from the external server limiting it to 50k iterations, it uses 15secs to read that small part, so it would not be possible to parse it directly from that server it seems.

Possible solutions

I think it's best to use cURL. But then again, perhaps fopen(), fsockopen(), copy() or file_get_contents() are the way to go. I'm looking for advice on what functions to use to make this happen, or different solutions on how I can parse a 50mb external XML file into a mySQL database.

I suspect a Cron job every hour would be the best solution, but I am not sure how well that would be supported by webhosting companies, and I have no clue how to do something like that. But if thats the best solution, and the majority thinks so, I will have to do my research in that area too.

If a java applet/javascript running in the background would be a better solution, please point me in the right direction when it comes to functions/methods/libraries there aswell.

Summary

What's the best solution to downloading parts of a file in the background, and resume the download each time my website is loaded until its completed?
If the above solution would be moronic to even try, what language/software would you use to achieve the same thing(download a large file every hour)?

Thanks in advance for all answers, and sorry for the long story/question.

Edit: I ended up using this solution to get the files with cron job scheduling a php script. It checks my folder for what files I already have, generates a list of the possible downloads for the last four days, then downloads the next XMLfile in line.

getTimestamp();
$four_days_ago = $current_time-345600;

echo 'Downloading: '."
";
for ($i=$four_days_ago; $i<=$current_time; ) {
    $date->setTimestamp($i);

    if($date->format('H') !== '00') {
        $temp_filename = $date->format('Y_m_d_H') ."_full.xml";
        if(!glob($temp_filename)) {
            $temp_url = 'http://www.external-site-example.com/'.$date->format('Y/m/d/H') .".xml";
            echo $temp_filename.' --- '.$temp_url.'
'."
";
            break; // with a break here, this loop will only return the next file you should download
        }
    }
    $i += 3600;
}

set_time_limit(300);
$Start = getTime(); 

$objInputStream = fopen($temp_url, "rb");
$objTempStream = fopen($temp_filename, "w+b");

stream_copy_to_stream($objInputStream, $objTempStream, (1024*200000));

$End = getTime();
echo '
It took '.number_format(($End - $Start),2).' secs to download "'.$temp_filename.'".';

function getTime() {
    $a = explode (' ',microtime());
    return(double) $a[0] + $a[1];
}
?>

edit2: I just wanted to inform you that there is a way to do what I asked, only it would'nt work in my case. With the amount of data I need the website would have to have 400+ visitors an hour for it to work properly. But with smaller amounts of data there are some options; http://www.google.no/search?q=poormanscron

Francis Avila · Accepted Answer

You need to have a scheduled, offline task (e.g., cronjob). The solution you are pursuing is just plain wrong.

The simplest thing that could possibly work is a php script you run every hour (scheduled via cron, most likely) that downloads the file and processes it.

Download a large XML file from an external source in the background, with the ability to resume download if incomplete

Some background information

Possible solutions

Summary

Answers (2)

Related Questions