Reputation: 1398
I am building a website where people can submit their blog addresses. What I'm trying to do is that when they submit a blog, for me to check the database to see if it's already in the database.
The problem that I have is that somebody can write the urls as "http://blog.com" or "http://www.blog.com" .
What would be the best way for me to check if the url is repeated?
What I think is I would check if the url has a "http://" and a "www", and check for the part after "www" but I feel this would be slow because I have more than 3000 urls. Thanks!
Upvotes: 0
Views: 130
Reputation: 95101
Dis-calmer : This is for experimental purpose, it suppose to guide you on the best format you want to use
I think you should save only the domain and sub domain .. I would demonstrate what i mean by this simple script
Image An array
$urls = array('http://blog.com',
'http://somethingelse.blog.com',
'http://something1.blog.com',
'ftp://blog.com',
'https://blog.com',
'http://www.blog.com',
'http://www.blog.net',
'blog.com',
'somethingelse.blog.com');
If you run
$found = array();
$blogUrl = new BlogURL();
foreach ( $urls as $url ) {
$domain = $blogUrl->parse($url);
if (! $domain) {
$blogUrl->log("#Parse can't parse $url");
continue;
}
$key = array_search($domain, $found);
if ($key !== false) {
$blogUrl->log("#Duplicate $url same as {$found[$key]}");
continue;
}
$found[] = $domain;
$blogUrl->log("#new $url has $domain");
}
var_dump($found);
Output
array
0 => string 'blog.com' (length=8)
1 => string 'somethingelse.blog.com' (length=22)
2 => string 'something1.blog.com' (length=19)
3 => string 'blog.net' (length=8)
If you want to see inner working
echo "<pre>";
echo implode(PHP_EOL, $blogUrl->getOutput());
Output
blog.com Found in http://blog.com
#new http://blog.com has blog.com
somethingelse.blog.com Found in http://somethingelse.blog.com
#new http://somethingelse.blog.com has somethingelse.blog.com
something1.blog.com Found in http://something1.blog.com
#new http://something1.blog.com has something1.blog.com
#error domain not found in ftp://blog.com
#Parse can't parse ftp://blog.com
blog.com Found in https://blog.com
#Duplicate https://blog.com same as blog.com
www.blog.com Found in http://www.blog.com
#Duplicate http://www.blog.com same as blog.com
www.blog.net Found in http://www.blog.net
#new http://www.blog.net has blog.net
#Fixed blog.com to
#Fixed http://blog.com to http://blog.com
blog.com Found in http://blog.com
#Duplicate blog.com same as blog.com
#Fixed somethingelse.blog.com to
#Fixed http://somethingelse.blog.com to http://somethingelse.blog.com
somethingelse.blog.com Found in http://somethingelse.blog.com
#Duplicate somethingelse.blog.com same as somethingelse.blog.com
Class Used
class BlogURL {
private $output;
function parse($url) {
if (! preg_match("~^(?:f|ht)tps?://~i", $url)) {
$this->log("#Fixed $url to ");
$url = "http://" . $url;
$this->log("#Fixed $url to $url");
}
if (! filter_var($url, FILTER_VALIDATE_URL)) {
$this->log("#Error $url not valid");
return false;
}
preg_match('!https?://(\S+)+!', $url, $matches);
$domain = isset($matches[1]) ? $matches[1] : null;
if (! $domain) {
$this->log("#error domain not found in $url");
return false;
}
$this->log($domain . " Found in $url");
return ltrim($domain, "w.");
}
function log($var = PHP_EOL) {
$this->output[] = $var;
}
function getOutput() {
return $this->output;
}
}
Upvotes: 0
Reputation: 522016
www.blog.com
and blog.com
may or may not be two entirely different blogs. For example, example.blogspot.com
and blogspot.com
are two entirely different sites. www.
is just a normal subdomain like any other and there's no rule on how it should behave. The same goes for the path following the domain; example.com/blorg
and example.com/foobarg
may be two independent blogs.
Therefore, you want to make an HTTP request to the given URL and see if it redirects somewhere. Typically there is one canonical URL, and www.blog.com
redirects to blog.com
or the other way around. So dig into the curl extension or any other favorite HTTP request module to make a request to the given URL and figure out which canonical URL it resolves to.
You may also want to parse the entire URL using parse_url
and only take the, for instance, hostname and path together as the unique identifier, ignoring other irregularities like the scheme or query parameters.
Upvotes: 1
Reputation: 5109
I would create an Url object which implements some compare interface (c#).
So you can do it like this.
var url = new Url("http://www.someblog.nl");
var url2 = new Url("http://someblog.nl");
if (url == url2)
{
throw new UrlNeedsToBeUniqueException();
}
You can implement the compare function with some regex or just always strip the www. part from the url with a string replace before you start to compare.
Upvotes: 0