Tom
Tom

Reputation: 53

How would you programmatically find out the total number of HTTP requests for a given URL in PHP?

Is there an easy way to do this without parsing the entire resource pointed to by the URL and finding out the different content types (images, javascript files, etc.) linked to inside that URL?

Upvotes: 3

Views: 2450

Answers (2)

DaveRandom
DaveRandom

Reputation: 88707

EDIT

This is easily possible using PhantomJS, which is a lot closer to the right tool for the job than PHP.


Original Answer (slightly modified)

To do this effectively would take so much work I doubt it's worth the bother.

The way I see it, you would have to use something like DOMDocument::loadHTML() to parse an HTML document, and look for all the src= and href= attributes and parse them. Sounds relatively simple, I know, but there are several thousand potential tripping points. Here are a few off the top of my head:

  • Firstly, you will have to check that the initial requested resource actually is an HTML document. This might be as simple as looking at the Content-Type: header of the response, but if the server doesn't behave correctly in this respect, you could get the wrong answer.
  • You would have to check for duplicated resources (like repeated images etc) that may not be specified in the same manner - e.g. if the document you are reading from example.com is at /dir1/dir2/doc.html and it uses an image /dir1/dir3/img.gif, some places in the document this might be refered to as /dir1/dir3/img.gif, some places it might be http://www.example.com/dir1/dir3/img.gif and some places it might be ../dir3/img.gif - you would have to recognise that this is one resource and would only result in one request.
  • You would have to watch out for browser specific stuff (like <!--[if IE]) and decide whether you wanted to include resources included in these blocks in the total count. This would also present a new problem with using the XML parser, since <!--[if IE] blocks are technically valid SGML comments and would be ignored.
  • You would have to parse any CSS docs and look for resources that are included with CSS declarations (like background-image:, for example). These resources would also have to be checked against the src/hrefs in the initial document for duplication.
  • Here is the really difficult one - you would have to look for resources dynamically added to the document on load via Javascript. For example, one of the ways you can use Google AdWords is with a neat little bit of JS that dynamically adds a new <script> element to the document, in order to get the actual script from Google. In order to do this, you would have to effectively evaluate and execute the Javascript on the page to see if it generates any new requests.

So you see, this would not be easy. I suspect it may actually be easier to go get the source of a browser and modify it. If you want to try and come up with a PHP based solution that comes up with an accurate answer be my guest (you might even be able to sell something as complicated as that) but honestly, ask yourself this - do I really have that much time on my hands?

Upvotes: 3

Gavin
Gavin

Reputation: 2183

Just some quick thoughts for you.

  1. You should be aware that caching, and the differences in the way in which browsers, obey and disobey caching directives can lead to different resource requests generated for the same page, by different browsers at different times, might be worth considering.

  2. If the purpose of your project is simply to measure this metric and you have control over the website in question you can pass every resource through a php proxy which can count the requests. i.e you can follow this pattern for ssi, scripts, styles, fonts, anything.

  3. If point 2 is not possible due to the nature of your website but you have access, then how about parsing the HTTP log? I would imagine this will be simple compared with trying to parse a html/php file, but could be very slow.

  4. If you don't have access to the website source / http logs, then I doubt you could do this with any real accuracy, huge amount of work involved, but you could use curl to fetch the initial HTML and then parse as per the instructions by DaveRandom.

I hope something in this is helpful for you.

Upvotes: 4

Related Questions