cs95
cs95

Reputation: 402473

Get protocol and domain (WITHOUT subdomain) from a URL

This is an extension of Get protocol + host name from URL, with the added requirement that I want only the domain name, not the subdomain.

So, for example,

Input: classes.usc.edu/xxx/yy/zz
Output: usc.edu

Input: mail.google.com
Output: google.com

Input: google.co.uk
Output: google.co.uk

For more context, I accept one or more seed URLs from a user and then run a scrapy crawler on the links. I need the domain name (without the subdomain) to set the allowed_urls attribute.

I've also taken a look at Python urlparse -- extract domain name without subdomain but the answers there seem outdated.

My current code uses urlparse but this also gets the subdomain which I don't want...

from urllib.parse import urlparse

uri = urlparse('https://classes.usc.edu/term-20191/classes/csci/')
f'{uri.scheme}://{uri.netloc}/'
# 'https://classes.usc.edu/'

Is there a (hopefully stdlib) way of getting (only) the domain in python-3.x?

Upvotes: 3

Views: 959

Answers (1)

BENY
BENY

Reputation: 323226

I am using tldextract When I doing the domain parse.

In your case you only need combine the domain + suffix

import tldextract
tldextract.extract('mail.google.com')
Out[756]: ExtractResult(subdomain='mail', domain='google', suffix='com')
tldextract.extract('classes.usc.edu/xxx/yy/zz')
Out[757]: ExtractResult(subdomain='classes', domain='usc', suffix='edu')
tldextract.extract('google.co.uk')
Out[758]: ExtractResult(subdomain='', domain='google', suffix='co.uk')

Upvotes: 4

Related Questions