Kim Stacks
Kim Stacks

Reputation: 10832

given 2 urls, how to tell that they are actually referring to the same website or webpage?

i am using Php.

given 2 urls like this, http://soccernet.com and http://soccernet.espn.go.com/index?cc=4716

how to tell that they are actually the same?

also consider situation where the difference is the httpS, like https://gmail.com and http://gmail.com

please advise. I am finding it a struggle at using regex because sometimes it is not very good for differentiating for eg, the soccernet example.

i am open to all sorts of possible good ideas and not limiting myself to just regex.

Edit: thanks for all the comments and answers below. how about a good idea for acquiring a level of certainty? what factors should i look for? how do i go about it in the most efficient way?

Upvotes: 0

Views: 187

Answers (7)

fabrik
fabrik

Reputation: 14375

Maybe cURL is your friend. It can follow redirects like this.

Upvotes: 0

ACoolie
ACoolie

Reputation: 1449

You can do an HTTP HEAD request to determine if the page is being redirected somewhere else. You could compare the actual response file, but with a website like ESPN even the same url will rarely respond with the same contents, due to tracking javascript and ads.

Use the get_headers() function and recursively follow the 'Location' key. So 'soccernet.com' redirects to 'http://soccernet.espn.go.com/archive/' which redirects to 'http://soccernet.espn.go.com/index'. Ignoring the query string, this url and the other url you have are equivalent.

print_r(get_headers('http://soccernet.espn.go.com/archive/'),1)

Upvotes: 0

Havenard
Havenard

Reputation: 27934

soccernet.com and soccernet.espn.go.com are completely different URLs. Its a very specific case when the program would need to HTTP access soccernet.com to notice it redirects to soccernet.espn.go.com. Is it viable for your case?

Upvotes: 0

John Boker
John Boker

Reputation: 83729

possibly you could have a level of certainty that they are the same, you can compare filesize after issuing a HEAD request, although that doesn't give you exactly what you want.

after doing the HEAD request you could get the contents to compare if the filesizes are the same.

Here is some info on doing a HEAD request:

http://www.eggheadcafe.com/tutorials/aspnet/2c13cafc-be1c-4dd8-9129-f82f59991517/the-lowly-http-head-reque.aspx

Upvotes: 0

Matthew Scharley
Matthew Scharley

Reputation: 132524

The only way is to download each page and compare them.

Really, this shouldn't be too much trouble, since your average HTML file is fairly small (normally well under 100KB's at the most). You don't need to download all the referenced files.

Upvotes: 1

John Saunders
John Saunders

Reputation: 161831

You cannot determine this, in the general case. http://server1/page.aspx and http://server2/page.aspx could be the same page, if server1 and server2 both map to the same IP address; in fact, if they both map to the same server farm.

In fact, even if they were the same page, they could have completely different contents, if the page renders differently based on the URL used to request it.

Upvotes: 0

chrissr
chrissr

Reputation: 9921

I really don't think this is possible, given your soccernet example, without actually comparing the output you get from each page.

Upvotes: 4

Related Questions