Reputation: 53
Is there an easy way to do this without parsing the entire resource pointed to by the URL and finding out the different content types (images, javascript files, etc.) linked to inside that URL?
Upvotes: 3
Views: 2450
Reputation: 88707
EDIT
This is easily possible using PhantomJS, which is a lot closer to the right tool for the job than PHP.
Original Answer (slightly modified)
To do this effectively would take so much work I doubt it's worth the bother.
The way I see it, you would have to use something like DOMDocument::loadHTML()
to parse an HTML document, and look for all the src=
and href=
attributes and parse them. Sounds relatively simple, I know, but there are several thousand potential tripping points. Here are a few off the top of my head:
Content-Type:
header of the response, but if the server doesn't behave correctly in this respect, you could get the wrong answer.example.com
is at /dir1/dir2/doc.html
and it uses an image /dir1/dir3/img.gif
, some places in the document this might be refered to as /dir1/dir3/img.gif
, some places it might be http://www.example.com/dir1/dir3/img.gif
and some places it might be ../dir3/img.gif
- you would have to recognise that this is one resource and would only result in one request.<!--[if IE]
) and decide whether you wanted to include resources included in these blocks in the total count. This would also present a new problem with using the XML parser, since <!--[if IE]
blocks are technically valid SGML comments and would be ignored.background-image:
, for example). These resources would also have to be checked against the src/hrefs in the initial document for duplication.<script>
element to the document, in order to get the actual script from Google. In order to do this, you would have to effectively evaluate and execute the Javascript on the page to see if it generates any new requests.So you see, this would not be easy. I suspect it may actually be easier to go get the source of a browser and modify it. If you want to try and come up with a PHP based solution that comes up with an accurate answer be my guest (you might even be able to sell something as complicated as that) but honestly, ask yourself this - do I really have that much time on my hands?
Upvotes: 3
Reputation: 2183
Just some quick thoughts for you.
You should be aware that caching, and the differences in the way in which browsers, obey and disobey caching directives can lead to different resource requests generated for the same page, by different browsers at different times, might be worth considering.
If the purpose of your project is simply to measure this metric and you have control over the website in question you can pass every resource through a php proxy which can count the requests. i.e you can follow this pattern for ssi, scripts, styles, fonts, anything.
If point 2 is not possible due to the nature of your website but you have access, then how about parsing the HTTP log? I would imagine this will be simple compared with trying to parse a html/php file, but could be very slow.
If you don't have access to the website source / http logs, then I doubt you could do this with any real accuracy, huge amount of work involved, but you could use curl to fetch the initial HTML and then parse as per the instructions by DaveRandom.
I hope something in this is helpful for you.
Upvotes: 4