Reputation: 10832
i am using Php.
given 2 urls like this, http://soccernet.com and http://soccernet.espn.go.com/index?cc=4716
how to tell that they are actually the same?
also consider situation where the difference is the httpS, like https://gmail.com and http://gmail.com
please advise. I am finding it a struggle at using regex because sometimes it is not very good for differentiating for eg, the soccernet example.
i am open to all sorts of possible good ideas and not limiting myself to just regex.
Edit: thanks for all the comments and answers below. how about a good idea for acquiring a level of certainty? what factors should i look for? how do i go about it in the most efficient way?
Upvotes: 0
Views: 187
Reputation: 1449
You can do an HTTP HEAD request to determine if the page is being redirected somewhere else. You could compare the actual response file, but with a website like ESPN even the same url will rarely respond with the same contents, due to tracking javascript and ads.
Use the get_headers() function and recursively follow the 'Location' key. So 'soccernet.com' redirects to 'http://soccernet.espn.go.com/archive/' which redirects to 'http://soccernet.espn.go.com/index'. Ignoring the query string, this url and the other url you have are equivalent.
print_r(get_headers('http://soccernet.espn.go.com/archive/'),1)
Upvotes: 0
Reputation: 27934
soccernet.com and soccernet.espn.go.com are completely different URLs. Its a very specific case when the program would need to HTTP access soccernet.com to notice it redirects to soccernet.espn.go.com. Is it viable for your case?
Upvotes: 0
Reputation: 83729
possibly you could have a level of certainty that they are the same, you can compare filesize after issuing a HEAD request, although that doesn't give you exactly what you want.
after doing the HEAD request you could get the contents to compare if the filesizes are the same.
Here is some info on doing a HEAD request:
Upvotes: 0
Reputation: 132524
The only way is to download each page and compare them.
Really, this shouldn't be too much trouble, since your average HTML file is fairly small (normally well under 100KB's at the most). You don't need to download all the referenced files.
Upvotes: 1
Reputation: 161831
You cannot determine this, in the general case. http://server1/page.aspx and http://server2/page.aspx could be the same page, if server1 and server2 both map to the same IP address; in fact, if they both map to the same server farm.
In fact, even if they were the same page, they could have completely different contents, if the page renders differently based on the URL used to request it.
Upvotes: 0
Reputation: 9921
I really don't think this is possible, given your soccernet example, without actually comparing the output you get from each page.
Upvotes: 4