Kerry Jones
Kerry Jones

Reputation: 21848

Regular Expressions - Parsing Domain Issues

I am trying to find the domain -- everything but the subdomain.

I have this regexp right now:

(?:[-a-zA-Z0-9]+\.)*([-a-zA-Z0-9]+(?:\.[a-zA-Z]{2,3})){1,2}

This works for things like:

domain.tld
subdomain.tld

But it runs into trouble with tld's like ".com.au" or ".co.uk":

domain.co.uk (finds co.uk, should find domain.co.uk)
subdomain.domain.co.uk (finds co.uk, should find domain.co.uk)

Any ideas?

Upvotes: 2

Views: 168

Answers (1)

sarnold
sarnold

Reputation: 104080

I'm not sure this problem is "reasonably solvable"; Mozilla maintains a list of 'public suffix' domains that is intended to help browser authors accept cookies for only domains within one administrative control (e.g., prevent someone from setting a cookie valid for *.co.uk. or *.union.aero.). It obviously isn't perfect (near the end, you'll find a long list of is-a-caterer.com-style domains, so foo.is-a-caterer.com couldn't set a cookie that would be used by bar.is-a-caterer.com, but is-a-caterer.com is perfectly well a "domain" as you've defined it.)

So, if you're prepared to use the list as provided, you could write a quick little parser that would know how to apply the general rules and exceptions to determine where in the given input string your "domain" comes, and return just the portion you're interested in.

I think simpler approaches are doomed to failure: some ccTLDs such as .ca don't use second-level domains, some such as .br use dozens, and some, like lib.or.us are several levels away from the "domain" such as multnomah.lib.or.us. Unless you're using curated lists of which domains are a public suffix, you're doomed to being wrong for some non-trivial set of input strings.

Upvotes: 3

Related Questions