Reputation:
Is this possible to do? I'm currently coding in PHP using cURL library but this rather applies to HTTP as a whole.
The most obvious way sounded like throwing a HEAD request to the data URL and read its Content-Length header, but the problem is that some servers including apache 2.0 does not send Content-Length against HEAD requests and since its not mandatory, there is no guarantee that all servers out there will reply with such information even on GET request.
I'm making the server download web pages specified by user input and store it on the server, but I do not want to let it download any requests only to find the file too large to be discarded after everything is downloaded to choke on the bandwidth from malicious requests. So I want to know the size of the content before the data is actually transfered, and reliably.
Cases of malicious web servers sending wrong Content-Length and those minor weird occasions do not concern me, if it works for all of the rest of general cases.
The worst idea so far in my mind is to actually just download the content with GET request and just drop the connection if it exceeds the size limit specified during the transfer, but this sounds like a very ugly solution on such a general protocol as HTTP.
Does anyone have any better ideas?
Upvotes: 1
Views: 428
Reputation: 19502
I stumbled upon your question, looking for the same answer. As there's no real answer yet, I've hacked up an implementation for myself. Of course, all the cautions mentioned still apply, and yes, it does use your "ugly" variant - but it's the only way to actually get at the data, provided the information exists.
/**
* Returns the size reported by the server, for the given URL, in bytes.
*
* Note this information may not be accurate, or may even be plain wrong.
*
* Also note, the return value is explicitly NOT converted to an integer, as
* the remote file might be bigger than 2^31, which may mess up the number if
* you are on a 32bit machine.
*
* @throws InvalidArgumentException on unknown URL scheme
* @throws Exception when unable to connect
* @param string $url
* @returns int
*/
function getURLDownloadSize($url) {
$parts = parse_url($url);
if(isset($parts['port'])) {
$port = $parts['port'];
}
else {
$port = 80;
}
if($parts['scheme'] != 'http') {
throw new \InvalidArgumentException('Scheme not supported');
}
$sock = fsockopen($parts['host'], $port, $errno, $errstr, 3);
if(!$sock) {
throw new \Exception(
sprintf(
'Unable to connect to host: %s',
$errstr
)
);
}
stream_set_timeout($sock, 5);
fwrite($sock, sprintf("GET %s HTTP/1.1\r\n", $parts['path']));
fwrite($sock, sprintf("Host: %s\r\n", $parts['host']));
fwrite($sock, "Connection: close\r\n" );
fwrite($sock, "\r\n" );
$data = fread($sock, 1024*20);
fclose($sock);
$matchresult = array();
if (preg_match('/Content-Length:\s+(\d+)/', $data, $matchresult)) {
return $matchresult[1];
}
return 0;
}
Upvotes: 0
Reputation: 143071
No, servers don't have to tell you the size of resource they're about to serve you because they may not have the knowledge themselves. So no, there's no universal way, but yes you can try looking up the Content-length
header whenever it is provided.
Upvotes: 3