Reputation: 2279
I am having one scenarion where I am checking if user submitted URL is already present in database or not. My concern is user can submit the same url in different format. e.g. URL http://mysite.com/rahul/palake/?&test=1 & URL http://www.mysite.com/rahul/palake/?&test=1 should be considered one and the same. If I have already stored the url as http://mysite.com/rahul/palake/?&test=1 in my database then searching for url http://www.mysite.com/rahul/palake/?&test=1 in database should give me message as url already existing. For this I am using following code, the following code works for me, I want to make sure it covers all possible scenarios? or this code can be improvised?
$url="http://dev.mysite.com/rahul/palake/?&test=1";
$parse_url=parse_url($url);
//first check if www is present in url or not
if(!strstr($parse_url['host'],'www'))
{
$scheme=trim($parse_url['scheme']);
//assign default scheme as http if scheme is not defined
if( $scheme =='')
$scheme='http';
//create new url with 'www' embeded in it
$url1=str_replace($scheme."://",$scheme."://www.",$url);
//now $url1 should be like this http://www.mysite.com/rahul/palake/?&test=1
}
//so that $url && $url1 should be considered as one and the same
//i.e. mysite.com/rahul/palake/?&test=1 is equivalent to www.mysite.com/rahul/palake/?&test=1
//should also be equivalent to http://mysite.com/rahul/palake/?&test=1
//code to check url already exists in database goes here
//here I will be checking if table.url like $url or table.url like $url1
//if record found then return msg as url already exists
Upvotes: 1
Views: 893
Reputation: 491
You also have to consider the fact that www could well under some circumstances be any number of subdomains in a load balanced environment. so www.mysite.com could be mysite.com or www2.mysite.com etc...
I believe a url by it's very nature should be unique and it's a perfectly scaenario that the example content may be very different between www.mysite.com and mysite.com.
If your objective with this code is to prevent content duplication then I have two suggestions for a better approach:
Automated: If you think you have a potential matching URL that is not identical then by using a curl like command you could retrieve the content of both urls and hash them to determine whether they are identical (this could give you false negatives for many reasons).
Manual: Much like other content submission system, you could present the user with a list of potential matches and ask them to verify their content is indeed unique. If you went down this path I would normalise the database to store each URL with a unique ID that you can then use to link it to the entity you are currently storing. This would allow you to have many entities referring to the one URL, if this is desired behavior.
Upvotes: 1
Reputation: 13557
What about www.example.org/?one=bar&two=foo
and www.example.org/?two=foo&one=bar
? they are the same URI (if normalized) but wouldn't match your regular string comparison. More examples of the same URI in different notations:
www.example.org/?one=bar&two=foo
and www.example.org/?one=bar&&&&two=foo
www.example.org/#foo
and www.example.org/#bar
www.example.org/hello/world.html
and www.example.org/hello/mars/../world.html
www.example.org:80/
and www.example.org/
www.EXAMPLE.org
and www.example.org/
www.example.org/%68%65%6c%6c%6f.html
and www.example.org/hello.html
Long story short: you need to normalize the URLs before storing them in the database in order to being able to compare them later on.
I don't know any PHP library that would do this for you. I've implemented this in javascript with URI.js - maybe you can use that to get started…
Upvotes: 2