Reputation: 4834
Decent chunks of this are outputs that can largely be ignored. It is still a bit of a reader, but I'm trying to be thorough in my analysis and questioning. If you are familiar with stream_get_meta_data
, you would be fine to skip to the "Questions" at the end.
Other than in the DOCs, I am having trouble finding out much about PHP's stream_get_meta_data
. The overall functionality is not vastly different to that of PHP's get_headers
, but I cannot for the life of me find any comparisons between the two, or pros/cons of the former.
Up until this point, I've always used PHP's get_headers
to verify the validity of a URL. The downside with get_headers
is that it is notoriously slow. Understandably, much of the latency is directly due to the server hosting the site of interest, but maybe the method is just overly robust, or something else is slowing it down.
There are plenty of links that recommend using CURL
, claiming that it is faster, but I've run side-by-side, timed tests of both, and get_headers
has always come out on top, often by a factor of 1.5 or 2.
I've yet to see any solutions using stream_get_meta_data
, and only just stumbled upon it for the first time today. I've exhausted my Google skills, without much luck. But, in the interest of optimizing my scheme, I ran some tests.
Comparisons between get_headers
and stream_get_meta_data
were run using a list of 106 current (i.e. live, valid, status=200) URLs:
Code Block #1
// All URLs in format "http://www.domain.com"
$urls = array('...', '...', '...'); // *106 URLs
// get_headers
$start = microtime(true);
foreach($urls as $url) {
try{
// Unfortunately, get_headers does not offer a context argument
stream_context_set_default(array('http' => array('method' => "HEAD")));
$headers[] = @get_headers($url, 1);
stream_context_set_default(array('http' => array('method' => "GET")));
}catch(Exception $e){
continue;
}
}
$end1 = microtime(true) - $start;
// stream_get_meta_data
$cont = stream_context_create(array('http' => array('method' => "HEAD")));
$start = microtime(true);
foreach($urls as $url) {
try{
$fp = fopen($url, 'rb', false, $cont);
if(!$fp) {
continue;
}
$streams[] = stream_get_meta_data($fp);
}catch(Exception $e){
continue;
}
}
$end2 = microtime(true) - $start;
And the results I'm getting are stream_get_meta_data
coming out on top, 90% of the time, or more. Sometimes the times are nearly identical, but more often than not stream_get_meta_data
has the shorter run-time
Run Times #1
"get_headers": 112.23 // seconds
"stream_get": 42.61 // seconds
With the [stringified] outputs of the two being something like:
Excerpt of Comparison #1
url .. "http://www.wired.com/"
get_headers
| 0 ............................ "HTTP/1.1 200 OK"
| Access-Control-Allow-Origin .. "*"
| Cache-Control ................ "stale-while-revalidate=86400, stale-while-error=86400"
| Content-Type ................. "text/html; charset=UTF-8"
| Link ......................... "; rel=\"https://api.w.org/\""
| Server ....................... "Apache"
| Via
| | "1.1 varnish"
| | "1.1 varnish"
|
| Fastly-Debug-State ........... "HIT"
| Fastly-Debug-Digest .......... "c245efbf14778c681ce317da114c1a762199e1326323d07b531d765e97fc8695"
| Content-Length ............... "135495"
| Accept-Ranges ................ "bytes"
| Date ......................... "Tue, 23 Aug 2016 22:32:26 GMT"
| Age .......................... "701"
| Connection ................... "close"
| X-Served-By .................. "cache-jfk8149-JFK, cache-den6024-DEN"
| X-Cache ...................... "HIT, HIT"
| X-Cache-Hits ................. "51, 1"
| X-Timer ...................... "S1471991546.459931,VS0,VE0"
| Vary ......................... "Accept-Encoding"
stream_get
| wrapper_data
| | "HTTP/1.1 200 OK"
| | "Access-Control-Allow-Origin: *"
| | "Cache-Control: stale-while-revalidate=86400, stale-while-error=86400"
| | "Content-Type: text/html; charset=UTF-8"
| | "Link: ; rel=\"https://api.w.org/\""
| | "Server: Apache"
| | "Via: 1.1 varnish"
| | "Fastly-Debug-State: HIT"
| | "Fastly-Debug-Digest: c245efbf14778c681ce317da114c1a762199e1326323d07b531d765e97fc8695"
| | "Content-Length: 135495"
| | "Accept-Ranges: bytes"
| | "Date: Tue, 23 Aug 2016 22:32:26 GMT"
| | "Via: 1.1 varnish"
| | "Age: 701"
| | "Connection: close"
| | "X-Served-By: cache-jfk8149-JFK, cache-den6020-DEN"
| | "X-Cache: HIT, HIT"
| | "X-Cache-Hits: 51, 1"
| | "X-Timer: S1471991546.614958,VS0,VE0"
| | "Vary: Accept-Encoding"
|
| wrapper_type ................. "http"
| stream_type .................. "tcp_socket/ssl"
| mode ......................... "rb"
| unread_bytes ................. 0
| seekable ..................... false
| uri .......................... "http://www.wired.com/"
| timed_out .................... false
| blocked ...................... true
| eof .......................... false
For the most part, all the same data, with the exception that stream_get_meta_data
doesn't offer any way to include keys for wrapper_data
, without parsing through it manually.
Easy enough...
Code Block #2.1/2.2
$wd = $meta[$url]['wrapper_data'];
$wArr = wrapperToKeys($wd);
where...
function wrapperToKeys($wd) {
$wArr = array();
foreach($wd as $row) {
$pos = strpos($row, ': '); // *Assuming* that all separated by ": " (Might be colon, without the space?)
if($pos === false) {
$wArr[] = $row;
}else {
// $pos, $key and $value can probably be done with one good preg_match
$key = substr($row, 0, $pos);
$value = substr($row, ($pos + 2));
// If key doesn't exist, assign value
if(empty($wArr[$key])) {
$wArr[$key] = $value;
}
// If key already points to an array, add value to array
else if(is_array($wArr[$key])) {
$wArr[$key][] = $value;
}
// If key currently points to string, swap value into an array
else {
$wArr[$key] = array($wArr[$key], $value);
}
}
}
return $wArr;
}
And the output is identical to get_headers($url, 1)
:
Excerpt of Comparison #2
url .. "http://www.wired.com/"
headers
| 0 ............................ "HTTP/1.1 200 OK"
| Access-Control-Allow-Origin .. "*"
| Cache-Control ................ "stale-while-revalidate=86400, stale-while-error=86400"
| Content-Type ................. "text/html; charset=UTF-8"
| Link ......................... "; rel=\"https://api.w.org/\""
| Server ....................... "Apache"
| Via
| | "1.1 varnish"
| | "1.1 varnish"
|
| Fastly-Debug-State ........... "HIT"
| Fastly-Debug-Digest .......... "c245efbf14778c681ce317da114c1a762199e1326323d07b531d765e97fc8695"
| Content-Length ............... "135495"
| Accept-Ranges ................ "bytes"
| Date ......................... "Tue, 23 Aug 2016 22:35:29 GMT"
| Age .......................... "883"
| Connection ................... "close"
| X-Served-By .................. "cache-jfk8149-JFK, cache-den6027-DEN"
| X-Cache ...................... "HIT, HIT"
| X-Cache-Hits ................. "51, 1"
| X-Timer ...................... "S1471991729.021214,VS0,VE0"
| Vary ......................... "Accept-Encoding"
w-arr
| 0 ............................ "HTTP/1.1 200 OK"
| Access-Control-Allow-Origin .. "*"
| Cache-Control ................ "stale-while-revalidate=86400, stale-while-error=86400"
| Content-Type ................. "text/html; charset=UTF-8"
| Link ......................... "; rel=\"https://api.w.org/\""
| Server ....................... "Apache"
| Via
| | "1.1 varnish"
| | "1.1 varnish"
|
| Fastly-Debug-State ........... "HIT"
| Fastly-Debug-Digest .......... "c245efbf14778c681ce317da114c1a762199e1326323d07b531d765e97fc8695"
| Content-Length ............... "135495"
| Accept-Ranges ................ "bytes"
| Date ......................... "Tue, 23 Aug 2016 22:35:29 GMT"
| Age .......................... "884"
| Connection ................... "close"
| X-Served-By .................. "cache-jfk8149-JFK, cache-den6021-DEN"
| X-Cache ...................... "HIT, HIT"
| X-Cache-Hits ................. "51, 1"
| X-Timer ...................... "S1471991729.173641,VS0,VE0"
| Vary ......................... "Accept-Encoding"
Even with sorting out the keys, stream_get_meta_data
is the champion:
Sample Run Times #2
"get_headers": 99.51 // seconds
"stream_get": 43.79 // seconds
Note: These tests are being run on a cheap shared server - hence the large variations in testing times. That being said, the gap between the two methods is highly consistent between tests.
For those of you who understand the c-code for PHP, and feel they might be able to gain some insight from it, the function definitions can be found at:
and
'stream_get_meta_data' (PHP Git)
How come stream_get_meta_data
is so underrepresented (in searches and available code snippets) compared to get_headers
?
The way I've worded this leads to opinions, but my intent is more along the lines of: "Is there something so well-known and terrible about stream_get_meta_data
that tends to deter people from using it?"
Similar to the previous, are there well-known, industry agreed-upon pros and cons between the two? The kinds of things that a more comprehensive understanding of CS would allude to. Perhaps get_headers
is more secure/robust, and less susceptible to ne'erdowells and inconsistencies with server outputs? Or maybe get_headers
is known to work in instances where stream_get_meta_data
produces and error?
From what I can find, stream_get_meta_data
does have a couple notes and warnings (... for fopen), but nothing so awful that they can't be worked around.
So long as it is safe and consistent, I would like to incorporate it into my project, seeing as this operation is performed often, and cutting the run time in half would make a substantial difference.
I have since found a few URLs that are successful with get_headers
but throw a warning for stream_get_meta_data
PHP Warning: fopen(http://www.alealimay.com/): failed to open stream: HTTP request failed! HTTP/1.0 400 Bad Request
PHP Warning: fopen(http://www.thelovelist.net/): failed to open stream: HTTP request failed! HTTP/1.0 400 Bad Request
PHP Warning: fopen(http://www.bleedingcool.com/): failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden
get_headers
returns only the 403 Forbidden
status, even though you can paste the URLs into a browser and see they are working sites.
Unsure about this: both the break-down of stream_get_meta_data
, and the incomplete header from get_headers
(should include all redirects and final status_code = 200
for functioning sites).
Much thanks, if you've made it this far.
Also, please comment if you down-vote, so I might be able to improve the question, and we can all learn for future cases.
Upvotes: 3
Views: 1010