Jeugasce
Jeugasce

Reputation: 197

Using REGEXP_EXTRACT to get domain and subdomains

I have only managed to extract the TLD of the list of websites that I have using

REGEXP_EXTRACT(Domain_name, r'(\.[^.:]*)]\.?:?[0-9]*$') AS web_tld

Example:

I have

www.example1.abc.com
www.example2.efg.123.net

I want the result

Subdomain

example1
efg

Domain

abc
123

TLD

.com
.net

EDIT: Encountered an error in my query 'Exactly one capturing group must be specified' when I use (.?([^.:]+).([^.:]+).([^.:]+):?[0-9]*$) as regex

SELECT
REGEXP_EXTRACT(Domain, r'(\.?([^.:]+)\.([^.:]+)\.([^.:]+):?[0-9]*$)'),
FROM [weblist.domain]
ORDER BY 1
LIMIT 250;

Upvotes: 8

Views: 11693

Answers (2)

Jordan Tigani
Jordan Tigani

Reputation: 26617

Note you may be better off using the HOST, DOMAIN, and TLD rather than custom regular expressions.

Upvotes: 6

Szymon
Szymon

Reputation: 43023

As you can only use one capturing group, I think you can actually use 3 separate regular expressions to get the values you want:

SELECT
REGEXP_EXTRACT(Domain, r'([^.:]+):?[0-9]*$'),
REGEXP_EXTRACT(Domain, r'([^.:]+).[^.:]+:?[0-9]*$'),
REGEXP_EXTRACT(Domain, r'([^.:]+).[^.:]+.[^.:]+:?[0-9]*$')
FROM [weblist.domain]
ORDER BY 1
LIMIT 250;

Upvotes: 9

Related Questions