q3d
q3d

Reputation: 3523

Python URL splitting

I have a string like google.com in Python, which I would like split into two parts: google and .com. The problem is where I have a URL such as subdomain.google.com, which I would like to be split into subdomain.google and .com.

How do I separate the rest of the URL from the TLD? It can't operate based on the last . in the URL because of TLDs such as .co.uk. Note the URL does not contain http:// or www.

Upvotes: 4

Views: 485

Answers (3)

odd86
odd86

Reputation: 36

I used tdl and urllib, but did not find them satisfying. I found this question multiple times on my Google search on how to parse a URL. After a while, I took the time to make a regex and make it into an open source package.

It handles URLs which have a secondary top-domain like co.uk, and also supports national URLs with special characters.

url-parser on PyPi
URL Parser on GitHub

For you, it would be easy to use it like this:

Step one:

pip install url-parser

Step two:

from url_parser import parse_url


url = parse_url('subdomain.google.com')
url['subdomain'] # subdomain
url['domain'] # google
url['top_domain'] #com

You can use these keys to get the different part of the URL.

  • protocol
  • www
  • sub_domain
  • domain
  • top_domain
  • dir
  • file
  • fragment
  • query

Upvotes: 1

Michael J
Michael J

Reputation: 7939

To do this, you will need a list of valid domain names. The top level ones (.com, .org, etc.) and the country codes (.us, .fr, etc.) are easy to find. Try http://www.icann.org/en/resources/registries/tlds.

For the second level ones (.co.uk, .org.au) you might need to look up each country code to see its sub domains. Wikipedia is your friend.

Once you have the list, grab the last two parts from the name you have (google.com or co.uk) and see if it is in your second level list. If not, grab the last part and see if it is in your top level list.

Upvotes: 0

eumiro
eumiro

Reputation: 212835

tldextract looks like what you need. It deals with the .co.uk issue.

Upvotes: 6

Related Questions