Birrel
Birrel

Reputation: 4834

PHP - Differences between `get_headers` and `stream_get_meta_data`?

Intro / Disclaimer

Decent chunks of this are outputs that can largely be ignored. It is still a bit of a reader, but I'm trying to be thorough in my analysis and questioning. If you are familiar with stream_get_meta_data, you would be fine to skip to the "Questions" at the end.

Other than in the DOCs, I am having trouble finding out much about PHP's stream_get_meta_data. The overall functionality is not vastly different to that of PHP's get_headers, but I cannot for the life of me find any comparisons between the two, or pros/cons of the former.

The Setup

Up until this point, I've always used PHP's get_headers to verify the validity of a URL. The downside with get_headers is that it is notoriously slow. Understandably, much of the latency is directly due to the server hosting the site of interest, but maybe the method is just overly robust, or something else is slowing it down.

There are plenty of links that recommend using CURL, claiming that it is faster, but I've run side-by-side, timed tests of both, and get_headers has always come out on top, often by a factor of 1.5 or 2.

I've yet to see any solutions using stream_get_meta_data, and only just stumbled upon it for the first time today. I've exhausted my Google skills, without much luck. But, in the interest of optimizing my scheme, I ran some tests.

The Testing

Comparisons between get_headers and stream_get_meta_data were run using a list of 106 current (i.e. live, valid, status=200) URLs:

Code Block #1

// All URLs in format "http://www.domain.com"
$urls = array('...', '...', '...'); // *106 URLs

// get_headers
$start = microtime(true);
foreach($urls as $url) {
    try{
        // Unfortunately, get_headers does not offer a context argument
        stream_context_set_default(array('http' => array('method' => "HEAD")));
        $headers[] = @get_headers($url, 1); 
        stream_context_set_default(array('http' => array('method' => "GET")));

    }catch(Exception $e){
        continue;
    }
}
$end1 = microtime(true) - $start;

// stream_get_meta_data
$cont = stream_context_create(array('http' => array('method' => "HEAD")));
$start = microtime(true);
foreach($urls as $url) {
    try{
        $fp = fopen($url, 'rb', false, $cont);
        if(!$fp) {
            continue;
        }
        $streams[] = stream_get_meta_data($fp);

    }catch(Exception $e){
        continue;
    }
}
$end2 = microtime(true) - $start;

And the results I'm getting are stream_get_meta_data coming out on top, 90% of the time, or more. Sometimes the times are nearly identical, but more often than not stream_get_meta_data has the shorter run-time

Run Times #1

"get_headers": 112.23 // seconds
"stream_get":  42.61 // seconds

With the [stringified] outputs of the two being something like:

Excerpt of Comparison #1

url  ..  "http://www.wired.com/"

get_headers
|    0  ............................  "HTTP/1.1 200 OK"
|    Access-Control-Allow-Origin  ..  "*"
|    Cache-Control  ................  "stale-while-revalidate=86400, stale-while-error=86400"
|    Content-Type  .................  "text/html; charset=UTF-8"
|    Link  .........................  "; rel=\"https://api.w.org/\""
|    Server  .......................  "Apache"
|    Via
|    |    "1.1 varnish"
|    |    "1.1 varnish"
|    
|    Fastly-Debug-State  ...........  "HIT"
|    Fastly-Debug-Digest  ..........  "c245efbf14778c681ce317da114c1a762199e1326323d07b531d765e97fc8695"
|    Content-Length  ...............  "135495"
|    Accept-Ranges  ................  "bytes"
|    Date  .........................  "Tue, 23 Aug 2016 22:32:26 GMT"
|    Age  ..........................  "701"
|    Connection  ...................  "close"
|    X-Served-By  ..................  "cache-jfk8149-JFK, cache-den6024-DEN"
|    X-Cache  ......................  "HIT, HIT"
|    X-Cache-Hits  .................  "51, 1"
|    X-Timer  ......................  "S1471991546.459931,VS0,VE0"
|    Vary  .........................  "Accept-Encoding"

stream_get
|    wrapper_data
|    |    "HTTP/1.1 200 OK"
|    |    "Access-Control-Allow-Origin: *"
|    |    "Cache-Control: stale-while-revalidate=86400, stale-while-error=86400"
|    |    "Content-Type: text/html; charset=UTF-8"
|    |    "Link: ; rel=\"https://api.w.org/\""
|    |    "Server: Apache"
|    |    "Via: 1.1 varnish"
|    |    "Fastly-Debug-State: HIT"
|    |    "Fastly-Debug-Digest: c245efbf14778c681ce317da114c1a762199e1326323d07b531d765e97fc8695"
|    |    "Content-Length: 135495"
|    |    "Accept-Ranges: bytes"
|    |    "Date: Tue, 23 Aug 2016 22:32:26 GMT"
|    |    "Via: 1.1 varnish"
|    |    "Age: 701"
|    |    "Connection: close"
|    |    "X-Served-By: cache-jfk8149-JFK, cache-den6020-DEN"
|    |    "X-Cache: HIT, HIT"
|    |    "X-Cache-Hits: 51, 1"
|    |    "X-Timer: S1471991546.614958,VS0,VE0"
|    |    "Vary: Accept-Encoding"
|    
|    wrapper_type  .................  "http"
|    stream_type  ..................  "tcp_socket/ssl"
|    mode  .........................  "rb"
|    unread_bytes  .................  0
|    seekable  .....................  false
|    uri  ..........................  "http://www.wired.com/"
|    timed_out  ....................  false
|    blocked  ......................  true
|    eof  ..........................  false

For the most part, all the same data, with the exception that stream_get_meta_data doesn't offer any way to include keys for wrapper_data, without parsing through it manually.

Easy enough...

Code Block #2.1/2.2

$wd = $meta[$url]['wrapper_data'];
$wArr = wrapperToKeys($wd);

where...

function wrapperToKeys($wd) {
    $wArr = array();
    foreach($wd as $row) {
        $pos = strpos($row, ': '); // *Assuming* that all separated by ": " (Might be colon, without the space?)

        if($pos === false) {
            $wArr[] = $row;
        }else {
            // $pos, $key and $value can probably be done with one good preg_match
            $key = substr($row, 0, $pos);
            $value = substr($row, ($pos + 2));

            // If key doesn't exist, assign value
            if(empty($wArr[$key])) {            
                $wArr[$key] = $value;
            }

            // If key already points to an array, add value to array
            else if(is_array($wArr[$key])) {    
                $wArr[$key][] = $value;
            }

            // If key currently points to string, swap value into an array
            else {                          
                $wArr[$key] = array($wArr[$key], $value);
            }
        }
    }
    
    return $wArr;
}

And the output is identical to get_headers($url, 1):

Excerpt of Comparison #2

url  ..  "http://www.wired.com/"

headers
|    0  ............................  "HTTP/1.1 200 OK"
|    Access-Control-Allow-Origin  ..  "*"
|    Cache-Control  ................  "stale-while-revalidate=86400, stale-while-error=86400"
|    Content-Type  .................  "text/html; charset=UTF-8"
|    Link  .........................  "; rel=\"https://api.w.org/\""
|    Server  .......................  "Apache"
|    Via
|    |    "1.1 varnish"
|    |    "1.1 varnish"
|    
|    Fastly-Debug-State  ...........  "HIT"
|    Fastly-Debug-Digest  ..........  "c245efbf14778c681ce317da114c1a762199e1326323d07b531d765e97fc8695"
|    Content-Length  ...............  "135495"
|    Accept-Ranges  ................  "bytes"
|    Date  .........................  "Tue, 23 Aug 2016 22:35:29 GMT"
|    Age  ..........................  "883"
|    Connection  ...................  "close"
|    X-Served-By  ..................  "cache-jfk8149-JFK, cache-den6027-DEN"
|    X-Cache  ......................  "HIT, HIT"
|    X-Cache-Hits  .................  "51, 1"
|    X-Timer  ......................  "S1471991729.021214,VS0,VE0"
|    Vary  .........................  "Accept-Encoding"

w-arr
|    0  ............................  "HTTP/1.1 200 OK"
|    Access-Control-Allow-Origin  ..  "*"
|    Cache-Control  ................  "stale-while-revalidate=86400, stale-while-error=86400"
|    Content-Type  .................  "text/html; charset=UTF-8"
|    Link  .........................  "; rel=\"https://api.w.org/\""
|    Server  .......................  "Apache"
|    Via
|    |    "1.1 varnish"
|    |    "1.1 varnish"
|    
|    Fastly-Debug-State  ...........  "HIT"
|    Fastly-Debug-Digest  ..........  "c245efbf14778c681ce317da114c1a762199e1326323d07b531d765e97fc8695"
|    Content-Length  ...............  "135495"
|    Accept-Ranges  ................  "bytes"
|    Date  .........................  "Tue, 23 Aug 2016 22:35:29 GMT"
|    Age  ..........................  "884"
|    Connection  ...................  "close"
|    X-Served-By  ..................  "cache-jfk8149-JFK, cache-den6021-DEN"
|    X-Cache  ......................  "HIT, HIT"
|    X-Cache-Hits  .................  "51, 1"
|    X-Timer  ......................  "S1471991729.173641,VS0,VE0"
|    Vary  .........................  "Accept-Encoding"

Even with sorting out the keys, stream_get_meta_data is the champion:

Sample Run Times #2

"get_headers": 99.51 // seconds
"stream_get": 43.79 // seconds

Note: These tests are being run on a cheap shared server - hence the large variations in testing times. That being said, the gap between the two methods is highly consistent between tests.

Additional

For those of you who understand the c-code for PHP, and feel they might be able to gain some insight from it, the function definitions can be found at:

'get_headers' (PHP Git)

and

'stream_get_meta_data' (PHP Git)

Questions

  1. How come stream_get_meta_data is so underrepresented (in searches and available code snippets) compared to get_headers?

    The way I've worded this leads to opinions, but my intent is more along the lines of: "Is there something so well-known and terrible about stream_get_meta_data that tends to deter people from using it?"

  2. Similar to the previous, are there well-known, industry agreed-upon pros and cons between the two? The kinds of things that a more comprehensive understanding of CS would allude to. Perhaps get_headers is more secure/robust, and less susceptible to ne'erdowells and inconsistencies with server outputs? Or maybe get_headers is known to work in instances where stream_get_meta_data produces and error?

    From what I can find, stream_get_meta_data does have a couple notes and warnings (... for fopen), but nothing so awful that they can't be worked around.

So long as it is safe and consistent, I would like to incorporate it into my project, seeing as this operation is performed often, and cutting the run time in half would make a substantial difference.

Edit #1

I have since found a few URLs that are successful with get_headers but throw a warning for stream_get_meta_data

PHP Warning:  fopen(http://www.alealimay.com/): failed to open stream: HTTP request failed! HTTP/1.0 400 Bad Request

PHP Warning:  fopen(http://www.thelovelist.net/): failed to open stream: HTTP request failed! HTTP/1.0 400 Bad Request

PHP Warning:  fopen(http://www.bleedingcool.com/): failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden

get_headers returns only the 403 Forbidden status, even though you can paste the URLs into a browser and see they are working sites.

Unsure about this: both the break-down of stream_get_meta_data, and the incomplete header from get_headers (should include all redirects and final status_code = 200 for functioning sites).


Much thanks, if you've made it this far.

Also, please comment if you down-vote, so I might be able to improve the question, and we can all learn for future cases.

Upvotes: 3

Views: 1010

Answers (0)

Related Questions