sjobe
sjobe

Reputation: 2837

Regular expression to retrieve domain.tld

I'm need a regular expression in Java that I can use to retrieve the domain.tld part from any url. So https://foo.com/bar, http://www.foo.com#bar, http://bar.foo.com will all return foo.com.

I wrote this regex, but it's matching the whole url

Pattern.compile("[.]?.*[.x][a-z]{2,3}");

I'm not sure I'm matching the "." character right. I tried "." but I get an error from netbeans.

Update:

The tld is not limited to 2 or 3 characters, and http://www.foo.co.uk/bar should return foo.co.uk.

Upvotes: 4

Views: 23869

Answers (8)

jsamsa
jsamsa

Reputation: 969

This is harder than you might imagine. Your example https://foo.com/bar, has a comma in it, which is a valid URL character. Here is a great post about some of the troubles:

https://blog.codinghorror.com/the-problem-with-urls/

https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])

Is a good starting point

Some listings from "Mastering Regular Expressions" on this topic:

http://regex.info/listing.cgi?ed=3&p=207

@sjobe

>>> import re
>>> pattern = r'https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])'
>>> url = re.compile(pattern)
>>> url.match('http://news.google.com/').groups()
('news.google.com/',)
>>> url.match('not a url').groups()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groups'
>>> url.match('http://google.com/').groups()
('google.com/',)
>>> url.match('http://google.com').groups()
('google.com',)

sorry the example is in python not java, it's more brief. Java requires some extraneous escaping of the regex.

Upvotes: 10

Yeongjun Kim
Yeongjun Kim

Reputation: 749

Code:

public class DomainUrlUtils {
    private static String[] TLD = {"com", "net"}; // top-level domain
    private static String[] SLD = {"co\\.kr"}; // second-level domain

    public static String getDomainName(String url) {
        Pattern pattern = Pattern.compile("(?<=)[^(\\.|\\/)]\\w+\\.(" + joinTldAndSld("|") + ")$");
        Matcher match = pattern.matcher(url);
        String domain = null;

        if (match.find()) {
            domain = match.group();
        }

        return domain;
    }

    private static String joinTldAndSld(String delimiter) {
        String t = String.join(delimiter, TLD);
        String s = String.join(delimiter, SLD);

        return new StringBuilder(t).append(s.isEmpty() ? "" : "|" + s).toString();
    }
}

Test:

public class DomainUrlUtilsTest {

    @Test
    public void getDomainName() throws Exception {
        // given
        String[][] domainUrls = {
            {
                "test.com",
                "sub1.test.com",
                "sub1.sub2.test.com",
                "https://sub1.test.com",
                "http://sub1.sub2.test.com"
            },
            {
                "https://domain.com",
                "https://sub.domain.com"
            },
            {
                "http://domain.co.kr",
                "http://sub.domain.co.kr",
                "http://local.sub.domain.co.kr",
                "http://local-test.sub.domain.co.kr",
                "sub.domain.co.kr",
                "domain.co.kr",
                "test.sub.domain.co.kr"
            }
        };

        String[] expectedUrls = {
            "test.com",
            "domain.com",
            "domain.co.kr"
        };

        // when
        // then
        for (int domainIndex = 0; domainIndex < domainUrls.length; domainIndex++) {
            for (String url : domainUrls[domainIndex]) {
                String convertedUrl = DomainUrlUtils.getDomainName(url);

                if (expectedUrls[domainIndex].equals(convertedUrl)) {
                    System.out.println(url + " -> " + convertedUrl);
                } else {
                    Assert.fail("origin Url: " + url + " / converted Url: " + convertedUrl);
                }
            }
        }
    }
}

Results:

test.com -> test.com
sub1.test.com -> test.com
sub1.sub2.test.com -> test.com
https://sub1.test.com -> test.com
http://sub1.sub2.test.com -> test.com
https://domain.com -> domain.com
https://sub.domain.com -> domain.com
http://domain.co.kr -> domain.co.kr
http://sub.domain.co.kr -> domain.co.kr
http://local.sub.domain.co.kr -> domain.co.kr
http://local-test.sub.domain.co.kr -> domain.co.kr
sub.domain.co.kr -> domain.co.kr

Upvotes: 0

tomisyourname
tomisyourname

Reputation: 91

This works for me:

public static String getDomain(String url){
    if(TextUtils.isEmpty(url)) return null;
    String domain = null;
    if(url.startsWith("http://")) {
        url = url.replace("http://", "").trim();
    } else if(url.startsWith("https://")) {
        url = url.replace("https://", "").trim();
    }
    String[] temp = url.split("/");
    if(temp != null && temp.length > 0) {
        domain = temp[0];
    }  
    return domain;
}

Upvotes: 0

mel
mel

Reputation: 1606

    /[^.]*\.[^.]{2,3}(?:\.[^.]{2,3})?$/

Almost there, but won't match when second-level domain has 3 characters like this: www.foo.com Test it here.

Upvotes: 0

Amy B
Amy B

Reputation: 17977

new URL(url).getHost()

No regex needed.

Upvotes: 4

Qtax
Qtax

Reputation: 33908

If the string contains a valid URL then you could use a regex like (Perl quoting):

/^
(?:\w+:\/\/)?
[^:?#\/\s]*?

(
[^.\s]+
\.(?:[a-z]{2,}|co\.uk|org\.uk|ac\.uk|org\.au|com\.au|___etc___)
)

(?:[:?#\/]|$)
/xi;

Results:

url: https://foo.com/bar
matched: foo.com
url: http://www.foo.com#bar
matched: foo.com
url: http://bar.foo.com
matched: foo.com
url: ftp://foo.com
matched: foo.com
url: ftp://www.foo.co.uk?bar
matched: foo.co.uk
url: ftp://www.foo.co.uk:8080/bar
matched: foo.co.uk

For Java it would be quoted something like:

"^(?:\\w+://)?[^:?#/\\s]*?([^.\\s]+\\.(?:[a-z]{2,}|co\\.uk|org\\.uk|ac\\.uk|org\\.au|com\\.au|___etc___))(?:[:?#/]|$)"

Of course you'll need to replace the etc part.

Example Perl script:

use strict;

my @test = qw(
    https://foo.com/bar
    http://www.foo.com#bar
    http://bar.foo.com
    ftp://foo.com
    ftp://www.foo.co.uk?bar
    ftp://www.foo.co.uk:8080/bar
);

for(@test){
    print "url: $_\n";

    /^
    (?:\w+:\/\/)?
    [^:?#\/\s]*?

    (
    [^.\s]+
    \.(?:[a-z]{2,}|co\.uk|org\.uk|ac\.uk|org\.au|com\.au|___etc___)
    )

    (?:[:?#\/]|$)
    /xi;

    print "matched: $1\n";
}

Upvotes: 6

idrosid
idrosid

Reputation: 8029

I would use the java.net.URI class to extract the host name, and then use a regex to extract the last two parts of the host uri.

import java.net.URI;
import java.net.URISyntaxException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RunIt {

    public static void main(String[] args) throws URISyntaxException {
        Pattern p = Pattern.compile(".*?([^.]+\\.[^.]+)");

        String[] urls = new String[] {
                "https://foo.com/bar",
                "http://www.foo.com#bar",
                "http://bar.foo.com"
        };

        for (String url:urls) {
            URI uri = new URI(url);
            //eg: uri.getHost() will return "www.foo.com"
            Matcher m = p.matcher(uri.getHost());
            if (m.matches()) {
                System.out.println(m.group(1));
            }
        }
    }
}

Prints:

foo.com
foo.com
foo.com

Upvotes: 8

Adam Pope
Adam Pope

Reputation: 3274

You're going to need to get a list of all possible TLDs and ccTLDs and then match against them. You have to do this else you'll never be able to distinguish between subdomain.dom.com and hello.co.uk.

So, get your self such a list. I recommend inverting it so you store, for example, uk.co. Then, you can extract the domain from a URL by getting everying between // and / or end of line. Split at . and work backwards, matching the TLD and then 1 additional level to get the domain.

Upvotes: 3

Related Questions