Reputation: 311
I have this query to extract domain from URLs...
SELECT SUBSTRING(LEFT(url, LOCATE('/', url, 8) - 1), 8) AS domain...
It works only when the URL is something like www.google.com/something
. It doesn't work with URLs like www.google.it
(without trailing slash) or www.google.it/abc/xzy/
(permalink).
Do you know how to fix this query to make it more robust to different types of URLs?
Upvotes: 31
Views: 39660
Reputation: 132
SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(website, '/', 3), '://', -1), '/', 1), '?', 1),'www.',-1)
removes http:// https:// www.
and combinations of them.
Leaves domian.com and sub.domian.com combinations.
Upvotes: 0
Reputation: 6597
I tried multiple examples on this page (and some documentation) docs to make the following, annotated, version. It appears impossible with SUBSTRING_INDEX to remove subdomains without removing *.co.uk type domains.
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(url,
'?', 1), # split on url params to remove weirdest stuff first
'://', -1), # remove protocal http:// https:// ftp:// ...
'/', 1), # split on path
':', 2), # split on user:pass
'@', 1), # split on user:port@
':', 1), # split on port
'www.', -1), # remove www.
'.', 4), # keep TLD + domain name
'/', 1)
AS domain
FROM (
SELECT 'http://test.com' as url
UNION SELECT 'https://test.com'
UNION SELECT 'http://test.com/one'
UNION SELECT 'http://test.com/?huh'
UNION SELECT 'http://www.test1.test.com?http://ouch.foo'
UNION SELECT 'test.com'
UNION SELECT 'test.com/one'
UNION SELECT 'test.com/one/two'
UNION SELECT 'test.com/one/two/three'
UNION SELECT 'test.com/one/two/three?u=http://maaaaannn'
UNION SELECT 'http://one.test.com'
UNION SELECT 'one.test.com/one'
UNION SELECT 'https://www.bbc.co.uk/'
UNION SELECT 'http://a.very.complex-domain.co.uk:8080/foo/bar'
UNION SELECT 'postgres://user:[email protected]:5432/path?k=v#f'
UNION SELECT 'http://10.64.3.5/data_check/index.php?r=index/rawdatacheck'
UNION SELECT 'two.one.test.com/one' ) AS test;
Upvotes: 1
Reputation: 83
For my case this worked fine:
SELECT SUBSTRING_INDEX(REPLACE(SUBSTRING_INDEX(url,'//',-1),'www.',''),'/',1) AS DOMAIN;
Upvotes: 2
Reputation: 8555
If you're not afraid of installing MySQL extensions (UDFs), then there's a UDF you can use that does exactly this while respecting different top-level domains like "google.com" and "google.co.uk"
https://github.com/StirlingMarketingGroup/mysql-get-etld-p1
select`get_etld_p1`('http://a.very.complex-domain.co.uk:8080/foo/bar');-- 'complex-domain.co.uk'
select`get_etld_p1`('https://www.bbc.co.uk/');-- 'bbc.co.uk'
select`get_etld_p1`('https://github.com/StirlingMarketingGroup/');-- 'github.com'
select`get_etld_p1`('https://localhost:10000/index');-- 'localhost'
select`get_etld_p1`('android-app://com.google.android.gm');-- 'com.google.android.gm'
select`get_etld_p1`('example.test.domain.com');-- 'domain.com'
select`get_etld_p1`('postgres://user:[email protected]:5432/path?k=v#f');-- 'host.com'
select`get_etld_p1`('exzvk.omsk.so-ups.ru');-- 'so-ups.ru'
select`get_etld_p1`('http://10.64.3.5/data_check/index.php?r=index/rawdatacheck');-- '10.64.3.5'
select`get_etld_p1`('not a domain');-- null
Upvotes: 1
Reputation: 55
select SUBSTRING_INDEX(SUBSTRING_INDEX(URL, '://', -1),'/',1) as DOMAIN
Upvotes: 1
Reputation: 876
If you want to remove www. along with http://, https:// and /(path) from your domain please do this:
SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(document_url, '/', 3), '://', -1), '/', 1), '?', 1),'www.',-1)
Upvotes: 7
Reputation: 2065
I had to combine some of the previous answers , plus a little more hackery for my data set . This is what works for me , it returns the domain and any sub-domains:
SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(target_url, '/', 3), '://', -1), '/', 1), '?', 1) AS domain
Explanation ( cause non-trivial SQL rarely makes sense ):
SUBSTRING_INDEX(target_url, '/', 3)
- strips any path if the url has a protocol
SUBSTRING_INDEX(THAT, '://', -1)
- strips any protocol from THAT
SUBSTRING_INDEX(THAT, '/', 1)
- strips any path from THAT ( if there was no protocol )
SUBSTRING_INDEX(THAT, '?', 1)
- strips the query string from THAT ( if there was no path or trailing / )
Test Cases:
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(target_url, '/', 3), '://', -1), '/', 1), '?', 1) AS domain
FROM (
SELECT 'http://test.com' as target_url
UNION SELECT 'https://test.com'
UNION SELECT 'http://test.com/one'
UNION SELECT 'http://test.com/?huh'
UNION SELECT 'http://test.com?http://ouch.foo'
UNION SELECT 'test.com'
UNION SELECT 'test.com/one'
UNION SELECT 'test.com/one/two'
UNION SELECT 'test.com/one/two/three'
UNION SELECT 'test.com/one/two/three?u=http://maaaaannn'
UNION SELECT 'http://one.test.com'
UNION SELECT 'one.test.com/one'
UNION SELECT 'two.one.test.com/one' ) AS Test;
Results:
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'test.com'
'one.test.com'
'one.test.com'
'two.one.test.com'
Upvotes: 66
Reputation: 1147
All the answers do not seem to work for me. For example, a url that starts variables with a ? fails for some answers. This works for me for all kinds of URLs:
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(REPLACE(REPLACE(LOWER(url), 'https://', ''), 'http://', ''), '/', 1), '?', 1) AS domain;
Upvotes: 4
Reputation: 136
Best to use it as it will also capture url like "www.google.co.in"
SELECT replace((replace((SUBSTRING_INDEX(SUBSTRING_INDEX(REPLACE(url, '//', ''), '/', 1), '*', -2)), 'http:','')),'https:','') AS Domain
Upvotes: 0
Reputation: 614
This one works well on my very messy data:
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(REPLACE(url, '//', '.'), '/', 1), '.', -2) AS Domain
Upvotes: 1
Reputation: 1
The query needs to account for no trailing '/'
IF( LOCATE('/', replace(url,'http://', '') ) > 0 , SUBSTRING(replace(url,'http://', ''), 1, LOCATE('/', replace(url,'http://', '') ) - 1), replace(url,'http://', '')) AS domain
Upvotes: 0
Reputation: 207
remove www., anysubdomain and everything after /:
SUBSTRING_INDEX((SUBSTRING_INDEX((SUBSTRING_INDEX(url, 'http://', -1)), '/', 1)), '.', -2) as domain
Upvotes: 19
Reputation: 36512
If you want to find the first occurrence of /
why are you passing 8 into LOCATE
as the starting position? (I assume to skip slashes after the protocol such as http:// but if you also expect url's without a protocol, consider short ones like cnn.com/page)
IF(LOCATE('/', url) > 0, SUBSTRING(url, 1, LOCATE('/', url) - 1), url) AS domain
Alternatively:
SUBSTRING_INDEX(url, '/', 1) AS domain
The latter seems easier to me.
Upvotes: 2