Reputation: 2837
I'm need a regular expression in Java that I can use to retrieve the domain.tld part from any url. So https://foo.com/bar, http://www.foo.com#bar, http://bar.foo.com will all return foo.com.
I wrote this regex, but it's matching the whole url
Pattern.compile("[.]?.*[.x][a-z]{2,3}");
I'm not sure I'm matching the "." character right. I tried "." but I get an error from netbeans.
Update:
The tld is not limited to 2 or 3 characters, and http://www.foo.co.uk/bar should return foo.co.uk.
Upvotes: 4
Views: 23869
Reputation: 969
This is harder than you might imagine. Your example https://foo.com/bar, has a comma in it, which is a valid URL character. Here is a great post about some of the troubles:
https://blog.codinghorror.com/the-problem-with-urls/
https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])
Is a good starting point
Some listings from "Mastering Regular Expressions" on this topic:
http://regex.info/listing.cgi?ed=3&p=207
@sjobe
>>> import re
>>> pattern = r'https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])'
>>> url = re.compile(pattern)
>>> url.match('http://news.google.com/').groups()
('news.google.com/',)
>>> url.match('not a url').groups()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groups'
>>> url.match('http://google.com/').groups()
('google.com/',)
>>> url.match('http://google.com').groups()
('google.com',)
sorry the example is in python not java, it's more brief. Java requires some extraneous escaping of the regex.
Upvotes: 10
Reputation: 749
Code:
public class DomainUrlUtils {
private static String[] TLD = {"com", "net"}; // top-level domain
private static String[] SLD = {"co\\.kr"}; // second-level domain
public static String getDomainName(String url) {
Pattern pattern = Pattern.compile("(?<=)[^(\\.|\\/)]\\w+\\.(" + joinTldAndSld("|") + ")$");
Matcher match = pattern.matcher(url);
String domain = null;
if (match.find()) {
domain = match.group();
}
return domain;
}
private static String joinTldAndSld(String delimiter) {
String t = String.join(delimiter, TLD);
String s = String.join(delimiter, SLD);
return new StringBuilder(t).append(s.isEmpty() ? "" : "|" + s).toString();
}
}
Test:
public class DomainUrlUtilsTest {
@Test
public void getDomainName() throws Exception {
// given
String[][] domainUrls = {
{
"test.com",
"sub1.test.com",
"sub1.sub2.test.com",
"https://sub1.test.com",
"http://sub1.sub2.test.com"
},
{
"https://domain.com",
"https://sub.domain.com"
},
{
"http://domain.co.kr",
"http://sub.domain.co.kr",
"http://local.sub.domain.co.kr",
"http://local-test.sub.domain.co.kr",
"sub.domain.co.kr",
"domain.co.kr",
"test.sub.domain.co.kr"
}
};
String[] expectedUrls = {
"test.com",
"domain.com",
"domain.co.kr"
};
// when
// then
for (int domainIndex = 0; domainIndex < domainUrls.length; domainIndex++) {
for (String url : domainUrls[domainIndex]) {
String convertedUrl = DomainUrlUtils.getDomainName(url);
if (expectedUrls[domainIndex].equals(convertedUrl)) {
System.out.println(url + " -> " + convertedUrl);
} else {
Assert.fail("origin Url: " + url + " / converted Url: " + convertedUrl);
}
}
}
}
}
Results:
test.com -> test.com
sub1.test.com -> test.com
sub1.sub2.test.com -> test.com
https://sub1.test.com -> test.com
http://sub1.sub2.test.com -> test.com
https://domain.com -> domain.com
https://sub.domain.com -> domain.com
http://domain.co.kr -> domain.co.kr
http://sub.domain.co.kr -> domain.co.kr
http://local.sub.domain.co.kr -> domain.co.kr
http://local-test.sub.domain.co.kr -> domain.co.kr
sub.domain.co.kr -> domain.co.kr
Upvotes: 0
Reputation: 91
This works for me:
public static String getDomain(String url){
if(TextUtils.isEmpty(url)) return null;
String domain = null;
if(url.startsWith("http://")) {
url = url.replace("http://", "").trim();
} else if(url.startsWith("https://")) {
url = url.replace("https://", "").trim();
}
String[] temp = url.split("/");
if(temp != null && temp.length > 0) {
domain = temp[0];
}
return domain;
}
Upvotes: 0
Reputation: 1606
/[^.]*\.[^.]{2,3}(?:\.[^.]{2,3})?$/
Almost there, but won't match when second-level domain has 3 characters like this: www.foo.com Test it here.
Upvotes: 0
Reputation: 33908
If the string contains a valid URL then you could use a regex like (Perl quoting):
/^
(?:\w+:\/\/)?
[^:?#\/\s]*?
(
[^.\s]+
\.(?:[a-z]{2,}|co\.uk|org\.uk|ac\.uk|org\.au|com\.au|___etc___)
)
(?:[:?#\/]|$)
/xi;
Results:
url: https://foo.com/bar
matched: foo.com
url: http://www.foo.com#bar
matched: foo.com
url: http://bar.foo.com
matched: foo.com
url: ftp://foo.com
matched: foo.com
url: ftp://www.foo.co.uk?bar
matched: foo.co.uk
url: ftp://www.foo.co.uk:8080/bar
matched: foo.co.uk
For Java it would be quoted something like:
"^(?:\\w+://)?[^:?#/\\s]*?([^.\\s]+\\.(?:[a-z]{2,}|co\\.uk|org\\.uk|ac\\.uk|org\\.au|com\\.au|___etc___))(?:[:?#/]|$)"
Of course you'll need to replace the etc part.
Example Perl script:
use strict;
my @test = qw(
https://foo.com/bar
http://www.foo.com#bar
http://bar.foo.com
ftp://foo.com
ftp://www.foo.co.uk?bar
ftp://www.foo.co.uk:8080/bar
);
for(@test){
print "url: $_\n";
/^
(?:\w+:\/\/)?
[^:?#\/\s]*?
(
[^.\s]+
\.(?:[a-z]{2,}|co\.uk|org\.uk|ac\.uk|org\.au|com\.au|___etc___)
)
(?:[:?#\/]|$)
/xi;
print "matched: $1\n";
}
Upvotes: 6
Reputation: 8029
I would use the java.net.URI class to extract the host name, and then use a regex to extract the last two parts of the host uri.
import java.net.URI;
import java.net.URISyntaxException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RunIt {
public static void main(String[] args) throws URISyntaxException {
Pattern p = Pattern.compile(".*?([^.]+\\.[^.]+)");
String[] urls = new String[] {
"https://foo.com/bar",
"http://www.foo.com#bar",
"http://bar.foo.com"
};
for (String url:urls) {
URI uri = new URI(url);
//eg: uri.getHost() will return "www.foo.com"
Matcher m = p.matcher(uri.getHost());
if (m.matches()) {
System.out.println(m.group(1));
}
}
}
}
Prints:
foo.com
foo.com
foo.com
Upvotes: 8
Reputation: 3274
You're going to need to get a list of all possible TLDs and ccTLDs and then match against them. You have to do this else you'll never be able to distinguish between subdomain.dom.com and hello.co.uk.
So, get your self such a list. I recommend inverting it so you store, for example, uk.co. Then, you can extract the domain from a URL by getting everying between // and / or end of line. Split at . and work backwards, matching the TLD and then 1 additional level to get the domain.
Upvotes: 3