Reputation: 3523
I have a string like google.com
in Python, which I would like split into two parts: google
and .com
. The problem is where I have a URL such as subdomain.google.com
, which I would like to be split into subdomain.google
and .com
.
How do I separate the rest of the URL from the TLD? It can't operate based on the last .
in the URL because of TLDs such as .co.uk
. Note the URL does not contain http:// or www.
Upvotes: 4
Views: 485
Reputation: 36
I used tdl and urllib, but did not find them satisfying. I found this question multiple times on my Google search on how to parse a URL. After a while, I took the time to make a regex and make it into an open source package.
It handles URLs which have a secondary top-domain like co.uk, and also supports national URLs with special characters.
url-parser on PyPi
URL Parser on GitHub
For you, it would be easy to use it like this:
Step one:
pip install url-parser
Step two:
from url_parser import parse_url
url = parse_url('subdomain.google.com')
url['subdomain'] # subdomain
url['domain'] # google
url['top_domain'] #com
You can use these keys to get the different part of the URL.
protocol
www
sub_domain
domain
top_domain
dir
file
fragment
query
Upvotes: 1
Reputation: 7939
To do this, you will need a list of valid domain names. The top level ones (.com, .org, etc.) and the country codes (.us, .fr, etc.) are easy to find. Try http://www.icann.org/en/resources/registries/tlds.
For the second level ones (.co.uk, .org.au) you might need to look up each country code to see its sub domains. Wikipedia is your friend.
Once you have the list, grab the last two parts from the name you have (google.com or co.uk) and see if it is in your second level list. If not, grab the last part and see if it is in your top level list.
Upvotes: 0
Reputation: 212835
tldextract looks like what you need. It deals with the .co.uk
issue.
Upvotes: 6