Andromeda
Andromeda

Reputation: 12897

Url splitting in php

I have an url like this:

http://www.w3schools.com/PHP/func_string_str_split.asp

I want to split that url to get the host part only. For that I am using

parse_url($url,PHP_URL_HOST);

it returns www.w3schools.com. I want to get only 'w3schools.com'. is there any function for that or do i have to do it manually?

Upvotes: 0

Views: 482

Answers (3)

Paul Dixon
Paul Dixon

Reputation: 300825

There are many ways you could do this. A simple replace is the fastest if you know you always want to strip off 'www.'

$stripped=str_replace('www.', '', $domain);

A regex replace lets you bind that match to the start of the string:

$stripped=preg_replace('/^www\./', '', $domain);

If it's always the first part of the domain, regardless of whether its www, you could use explode/implode. Though it's easy to read, it's the most inefficient method:

$parts=explode('.', $domain);
array_shift($parts); //eat first element
$stripped=implode('.', $parts);

A regex achieves the same goal more efficiently:

$stripped=preg_replace('/^\w+\./', '', $domain);

Now you might imagine that the following would be more efficient than the above regex:

$period=strpos($domain, '.');
if ($period!==false)
{
    $stripped=substr($domain,$period+1);
}
else
{
    $stripped=$domain; //there was no period
}

But I benchmarked it and found that over a million iterations, the preg_replace version consistently beat it. Typical results, normalized to the fastest (so it has a unitless time of 1):

  • Simple str_replace: 1
  • preg_replace with /^\w+\./: 1.494
  • strpos/substr: 1.982
  • explode/implode: 2.472

The above code samples always strip the first domain component, so will work just fine on domains like "www.example.com" and "www.example.co.uk" but not "example.com" or "www.department.example.com". If you need to handle domains that may already be the main domain, or have multiple subdomains (such as "foo.bar.baz.example.com") and want to reduce them to just the main domain ("example.com"), try the following. The first sample in each approach returns only the last two domain components, so won't work with "co.uk"-like domains.

  • explode:

    $parts = explode('.', $domain);
    $parts = array_slice($parts, -2);
    $stripped = implode('.', $parts);
    

    Since explode is consistently the slowest approach, there's little point in writing a version that handles "co.uk".

  • regex:

    $stripped=preg_replace('/^.*?([^.]+\.[^.]*)$/', '$1', $domain);
    

    This captures the final two parts from the domain and replaces the full string value with the captured part. With multiple subdomains, all the leading parts get stripped.

    To work with ".co.uk"-like domains as well as a variable number of subdomains, try:

    $stripped=preg_replace('/^.*?([^.]+\.(?:[^.]*|[^.]{2}\.[^.]{2}))$/', '$1', $domain);
    
  • str:

    $end = strrpos($domain, '.') - strlen($domain) - 1;
    $period = strrpos($domain, '.', $end);
    if ($period !== false) {
        $stripped = substr($domain,$period+1);
    } else {
        $stripped = $domain;
    }
    

    Allowing for co.uk domains:

    $len = strlen($domain);
    if ($len < 7) {
        $stripped = $domain;
    } else {
        if ($domain[$len-3] === '.' && $domain[$len-6] === '.') {
            $offset = -7;
        } else {
            $offset = -5;
        }
        $period = strrpos($domain, '.', $offset);
        if ($period !== FALSE) {
            $stripped = substr($domain,$period+1);
        } else {
            $stripped = $domain;
        }
    }
    

The regex and str-based implementations can be made ever-so-slightly faster by sacrificing edge cases (where the primary domain component is a single letter, e.g. "a.com"):

  • regex:

    $stripped=preg_replace('/^.*?([^.]{3,}\.(?:[^.]+|[^.]{2}\.[^.]{2}))$/', '$1', $domain);
    
  • str:

    $period = strrpos($domain, '.', -7);
    if ($period !== FALSE) {
        $stripped = substr($domain,$period+1);
    } else {
        $stripped = $domain;
    }
    

Though the behavior is changed, the rankings aren't (most of the time). Here they are, with times normalized to the quickest.

  • multiple subdomain regex: 1
  • .co.uk regex (fast): 1.01
  • .co.uk str (fast): 1.056
  • .co.uk regex (correct): 1.1
  • .co.uk str (correct): 1.127
  • multiple subdomain str: 1.282
  • multiple subdomain explode: 1.305

Here, the difference between times is so small that it wasn't unusual for . The fast .co.uk regex, for example, often beat the basic multiple subdomain regex. Thus, the exact implementation shouldn't have a noticeable impact on speed. Instead, pick one based on simplicity and clarity. As long as you don't need to handle .co.uk domains, that would be the multiple subdomain regex approach.

Upvotes: 6

Rutesh Makhijani
Rutesh Makhijani

Reputation: 17225

You need to strip off any characters before the first occurencec of [.] character (along with the [.] itself) if and only if there are more than 1 occurence of [.] in the returned string.

for example if the returned string is www-139.in.ibm.com then the regular expression should be such that it returns in.ibm.com since that would be the domain.

If the returned string is music.domain.com then the regular expression should return domain.com

In rare cases you get to access the site without the prefix of the server that is you can access the site using http://domain.com/pageurl, in this case you would get the domain directly as domain.com, in such case the regex should not strip anything

IMO this should be the pseudo logic of the regex, if you want I can form a regex for you that would include these things.

Upvotes: 0

Stefan Gehrig
Stefan Gehrig

Reputation: 83622

You have to strip off the subdomain part by yourself - there is no built-in function for this.

// $domain beeing www.w3scools.com
$domain = implode('.', array_slice(explode('.', $domain), -2));

The above example also works for subdomains of a unlimited depth as it'll alwas return the last two domain parts (domain and top-level-domain).

If you only want to strip off www. you can simply do a str_replace(), which will be faster indeed:

$domain = str_replace('www.', '', $domain);

Upvotes: 0

Related Questions