Reputation: 2145
I have a list of URLS (including http://), where some are just domain names and some others include full path.
How could I programmatically using shell scripting, extract the extension (.com, .net...), taking in consideration that some extensions are .co.uk for example?
Upvotes: 0
Views: 647
Reputation: 16907
Essentially you'd need a list of everything you're considering a "TLD" There are a finite number of these. Then for each URL, you'd see if anything in your list matches that URL, and if so, print it out. The reason you need to construct the list yourself is that .co.uk is not a TLD. .uk is the TLD and .co is a subdomain.
Or you could construct an enormously long regex (for example, extracting .co.uk, .com, .ca, .biz):
$ perl -ne 'next unless /^http:\/\/[^ \/?]+(\.com|\.co\.uk|\.ca|\.biz)/; print $1, "\n"'
Upvotes: 2
Reputation: 40733
The most robust way is to use a library to parse the url. For example, in Python:
from urlparse import urlparse
domain = urlparse('http://www.mydomain.co.uk/path/to/file.html').netloc
tld = domain.split('.')[-1]
print tld
will prints out just the net location (or what I think you meant TLD in this case)
UPDATE: prints the TLD this time, instead of the whole domain.
Upvotes: 2