Reputation: 233
I'm currently working on a program that pulls back various data from a given URL. For this I have a function that combs through the URL and builds an array of all of the locations from the source code. This works perfectly and I've managed to filter the pages to check if there files, not on the page etc..
My trouble is I have tested this on some sites that have a menu with sub-pages under the menu options. The main option across the navigation bar will have a page value and the first option on the sub-navigation will be the same page but with a value on the end of the URL (primarily to toggle between javascript). I have tried encoding the page and comparing it (to shorten processing time) however on some of the sites the URL is put into a form field.
Example:
Option1 - www.example.com/page1
- first opt - www.example.com/page1?t=1
- second opt - ww.example.com/page1?t=2
It won't be possible to strip off the additional tags from what it seems as some sites use these values solely whereas other pages use JS. As the URLs are technically different, is there a way to check if the pages are the same even though they are on different URLs?
Upvotes: 1
Views: 1597
Reputation: 2275
In your situation I may suggest you retrieve headers only and compare Content-Length headers.
function content_length($url){
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, TRUE);
curl_setopt($ch, CURLOPT_NOBODY, TRUE);
$data = curl_exec($ch);
$size = curl_getinfo($ch, CURLINFO_CONTENT_LENGTH_DOWNLOAD);
curl_close($ch);
return $size;
}
function filterURL($url) {
// Make urls similar as possible. E.g. strip all after hash-symbol.
return substr($url, 0, strpos($url, "#"));
}
$url1 = 'http://example.com/page/?foo=1#bar';
$url2 = 'http://example.com/page/?foo=2#bar2';
if (content_length(filterURL($url1)) == content_length(filterURL($url1))) {
print "Same";
} else {
print "Different";
doWhatYouNeedToDo();
}
This is not guarantee that pages are the same of different, but it isn't require you to download the whole page.
Upvotes: 2