Reputation: 3523

Python URL splitting

I have a string like google.com in Python, which I would like split into two parts: google and .com. The problem is where I have a URL such as subdomain.google.com, which I would like to be split into subdomain.google and .com.

How do I separate the rest of the URL from the TLD? It can't operate based on the last . in the URL because of TLDs such as .co.uk. Note the URL does not contain http:// or www.

Upvotes: 4

Answers (3)

odd86

Reputation: 36

I used tdl and urllib, but did not find them satisfying. I found this question multiple times on my Google search on how to parse a URL. After a while, I took the time to make a regex and make it into an open source package.

It handles URLs which have a secondary top-domain like co.uk, and also supports national URLs with special characters.

url-parser on PyPi
URL Parser on GitHub

For you, it would be easy to use it like this:

Step one:

pip install url-parser

Step two:

from url_parser import parse_url


url = parse_url('subdomain.google.com')
url['subdomain'] # subdomain
url['domain'] # google
url['top_domain'] #com

You can use these keys to get the different part of the URL.

protocol
www
sub_domain
domain
top_domain
dir
file
fragment
query

Upvotes: 1

Michael J

Reputation: 7949

To do this, you will need a list of valid domain names. The top level ones (.com, .org, etc.) and the country codes (.us, .fr, etc.) are easy to find. Try http://www.icann.org/en/resources/registries/tlds.

For the second level ones (.co.uk, .org.au) you might need to look up each country code to see its sub domains. Wikipedia is your friend.

Once you have the list, grab the last two parts from the name you have (google.com or co.uk) and see if it is in your second level list. If not, grab the last part and see if it is in your top level list.

Upvotes: 0

eumiro

Reputation: 213125

tldextract looks like what you need. It deals with the .co.uk issue.

Upvotes: 6

Python URL splitting

Answers (3)

Related Questions