Reputation: 1992
I have a column which has only ".edu .gov .org .com" domains stored in it. There is another column in the same table which stores CompanyID. I want to know the best procedure to categorize the company based on their domain name. i.e.: Educational for .edu, government for .gov etc.
My table looks like this:
CompaniID Website
--------------------------------------------
1 www.comp1.gov/index.php
2 www.great-nice.comp.edu/#contact
I tried the following approach to handle the TLD but it fails for some cases. Once I figure the TLD comparison, I can handle the rest of the query.
Snippet:
CASE
WHEN
REGEXP_Instr(Website, '.com') > 0
AND
NOT REGEXP_LIKE (SUBSTR (Website, REGEXP_Instr(Website, '.com') + 4,1), '^[a-z]|^[0-9]', 'i')
THEN 'For Profit'
ELSE 'Others'
END
Upvotes: 2
Views: 1230
Reputation: 10525
You can USE \.[a-z]+(/|$)
pattern to extract characters before the '/' characters.
WITH x (y)
AS (SELECT 'www.comp1.gov/index.php' FROM DUAL UNION ALL
SELECT 'www.great-nice.comp.edu/#contact' FROM DUAL UNION ALL
SELECT 'www.comp1.edu' FROM DUAL
)
SELECT y,
RTRIM (REGEXP_SUBSTR (y, '\.[a-z]+(/|$)'), '/') tld,
REGEXP_SUBSTR (y, '\.([a-z]+)(/|$)', 1, 1, NULL, 1) tld2 --works only in 11gR2 and higher versions
FROM x;
| y | tld | tld2 |
|-----------------------------------|-----|------|
| www.comp1.gov/index.php |.gov | gov |
| www.great-nice.comp.edu/#contact |.edu | edu |
| www.comp1.edu |.edu | edu |
Use this in CASE statement to categorise as per your needs.
Upvotes: 2
Reputation: 9344
That would match things like www.comed.com
. You probably want
.com(?:[/]|$)
Upvotes: 0