Reputation: 2337
i have a list of url's i need to filter specific domain and subdomain. say i have some domains like
http://www.example.com
http://test.example.com
http://test2.example.com
I need to extract urls which from domain example.com.
Upvotes: 1
Views: 4837
Reputation: 632
Working on project that required me to determine if two URLs are from the same sub domain (even when there are nested domains). I worked up a modification from the guide above. This holds out pretty well thus far:
public static boolean isOneSubdomainOfTheOther(String a, String b) {
try {
URL first = new URL(a);
String firstHost = first.getHost();
firstHost = firstHost.startsWith("www.") ? firstHost.substring(4) : firstHost;
URL second = new URL(b);
String secondHost = second.getHost();
secondHost = secondHost.startsWith("www.") ? secondHost.substring(4) : secondHost;
/*
Test if one is a substring of the other
*/
if (firstHost.contains(secondHost) || secondHost.contains(firstHost)) {
String[] firstPieces = firstHost.split("\\.");
String[] secondPieces = secondHost.split("\\.");
String[] longerHost = {""};
String[] shorterHost = {""};
if (firstPieces.length >= secondPieces.length) {
longerHost = firstPieces;
shorterHost = secondPieces;
} else {
longerHost = secondPieces;
shorterHost = firstPieces;
}
//int longLength = longURL.length;
int minLength = shorterHost.length;
int i = 1;
/*
Compare from the tail of both host and work backwards
*/
while (minLength > 0) {
String tail1 = longerHost[longerHost.length - i];
String tail2 = shorterHost[shorterHost.length - i];
if (tail1.equalsIgnoreCase(tail2)) {
//move up one place to the left
minLength--;
} else {
//domains do not match
return false;
}
i++;
}
if (minLength == 0) //shorter host exhausted. Is a sub domain
return true;
}
} catch (MalformedURLException ex) {
ex.printStackTrace();
}
return false;
}
Figure I'd leave it here for future reference of a similar problem.
Upvotes: 3
Reputation: 527
I understand you are probably looking for a fancy solution using URL class or something but it is not required. Simply think of a way to extract "example.com" from each of the urls.
Note: example.com is essentially a different domain than say example.net. Thus extracting just "example" is technically the wrong thing to do.
We can divide a sample url say:
http://sub.example.com/page1.html
Step 1: Split the url with delimiter " / " to extract the part containing the domain.
Each such part may be looked at in form of the following blocks (which may be empty)
[www][subdomain][basedomain]
Step 2: Discard "www" (if present). We are left with [subdomain][basedomain]
Step 3: Split the string with delimiter " . "
Step 4: Find the total number of strings generated from the split. If there are 2 strings, both of them are the target domain (example and com). If there are >=3 strings, get the last 3 strings. If the length of last string is 3, then the last 2 strings comprise the domain (example and com). If the length of last string is 2, then the last 3 strings comprise the domain (example and co and uk)
I think this should do the trick (I do hope this wasn't a homework :D )
//You may clean this method to make it more optimum / better
private String getRootDomain(String url){
String[] domainKeys = url.split("/")[2].split("\\.");
int length = domainKeys.length;
int dummy = domainKeys[0].equals("www")?1:0;
if(length-dummy == 2)
return domainKeys[length-2] + "." + domainKeys[length-1];
else{
if(domainKeys[length-1].length == 2) {
return domainKeys[length-3] + "." + domainKeys[length-2] + "." + domainKeys[length-1];
}
else{
return domainKeys[length-2] + "." + domainKeys[length-1];
}
}
}
Upvotes: 2