Reputation: 13527
If I download a file from a website using:
$html = file_get_html($url);
Then how can I know the size, in kilobyes, of the HTML string? I want to know, because I want to skip files over 100Kb.
Upvotes: 2
Views: 338
Reputation: 46816
To skip fetching large files, you want to use the cURL library.
<?php
function get_content_length($url) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_NOBODY, 1);
$hraw=explode("\r\n",curl_exec($ch));
curl_close($ch);
$hdrs=array();
foreach($hraw as $hdr) {
$a=explode(": ", trim($hdr));
$hdrs[$a[0]]=$a[1];
}
return (isset($hdrs['Content-Length'])) ? $hdrs['Content-Length'] : FALSE;
}
$url="http://www.example.com/";
if (get_content_length($url) < 100000) {
$html = file_get_contents($url);
print "Yes.\n";
} else {
print "No.\n";
}
?>
There may be a more elegant way to pull this information out of curl, but this is what came to mind fastest. YMMV.
Note that setting the CURLOPT options this way makes curl use a "HEAD" rather than "GET" request, so we're not actually fetching this URL twice.
Upvotes: 3
Reputation: 13975
You can use mb_strlen to force 8bit or what not and then 1 character = 1 byte
Upvotes: 0
Reputation: 521995
file_get_html
returns an object to you, the information of how big the string is is lost at that point. Get the string first, the object later:
$html = file_get_contents($url);
echo strlen($html); // size in bytes
$html = str_get_html($html);
Upvotes: 0
Reputation: 65244
The definition, what a string is, is different between PHP and the intuitive meaning:
"Hällo" (mind the Umlaut) looks like a 5-character String, but to PHP it is really a 6-byte array (assuming UTF8) - PHP doesn't have a notion of a String representing text, it just sees it as a sequence of bytes (The PHP euphemism is "binary safe").
So strlen("Hällo") will be 6 (UTF8).
That said, if you want to skip above 100Kb you probably won't mind if it is 99.5k characters translating to 100k bytes.
Upvotes: 2
Reputation: 98459
If you do file_get_contents
, you've already gotten the whole file.
If you mean "skip processing", rather than "skip retrieval", you can just get the length of the string: strlen($html)
. For kilobytes, divide that by 1024.
This is imprecise because the string may contain UTF-8 characters over one byte in length, and very small files will actually occupy a FS block instead of their byte length, but it's probably good enough for the arbitrary-threshold cutoff you're looking for.
Upvotes: 4