Reputation: 627
I have two tables, one containing list of URL and other having a list of words. My requirement is to filter out the URLs containing the words. For eg:
URL
https://www.techhive.com/article/3409153/65-inch-oled-4k-tv-from-lg-at-a-1300-dollar-discount.html
https://www.techradar.com/in/news/lg-c9-oled-65-inch-4ktv-price-drop
https://www.t3.com/news/cheap-oled-tv-deals-currys-august
https://indianexpress.com/article/technology/gadgets/lg-bets-big-on-oled-tvs-in-india-to-roll-out-rollable-tv-by-year-end-5823635/
https://www.sony.co.in/electronics/televisions/a1-series
https://www.amazon.in/Sony-138-8-inches-Bravia-KD-55A8F/dp/B07BWKVBYW
https://www.91mobiles.com/list-of-tvs/sony-oled-tv
Words
Sony
Samsung
Deal
Bravia
Now I want to filter any URL that has any of the words. Normally i would do a
Select url from url_table where url not like '%Sony%' or url not like '%Samsung%' or url not like '%Deal%' or not like '%Bravia%';
But that's a cumbersome and not scalable way to do it. What is the best way to achieve this? How do I use a not like function to the words table?
Upvotes: 1
Views: 276
Reputation: 38325
Using regex:
where url not rlike '(?i)Sony|Samsung|Deal|Bravia'
(?i)
means case insesitive.
And now let's build the same regexp from the table with words.
You can aggregate list of words from the table and pass it to the rlike. See this example:
with
initial_data as (--replace with your table
select stack(7,
'https://www.techhive.com/article/3409153/65-inch-oled-4k-tv-from-lg-at-a-1300-dollar-discount.html',
'https://www.techradar.com/in/news/lg-c9-oled-65-inch-4ktv-price-drop',
'https://www.t3.com/news/cheap-oled-tv-deals-currys-august',
'https://indianexpress.com/article/technology/gadgets/lg-bets-big-on-oled-tvs-in-india-to-roll-out-rollable-tv-by-year-end-5823635/',
'https://www.sony.co.in/electronics/televisions/a1-series',
'https://www.amazon.in/Sony-138-8-inches-Bravia-KD-55A8F/dp/B07BWKVBYW',
'https://www.91mobiles.com/list-of-tvs/sony-oled-tv'
) as url ) ,
words as (-- replace with your words table
select stack (4, 'Sony','Samsung','Deal','Bravia') as word
),
sub as (--aggregate list of words for rlike
select concat('''','(?i)',concat_ws('|',collect_set(word)),'''') words_regex from words
)
select s.url
from initial_data s cross join sub --cross join with words_regex
where url not rlike sub.words_regex --rlike works fine
Result:
OK
url
https://www.techhive.com/article/3409153/65-inch-oled-4k-tv-from-lg-at-a-1300-dollar-discount.html
https://www.techradar.com/in/news/lg-c9-oled-65-inch-4ktv-price-drop
https://indianexpress.com/article/technology/gadgets/lg-bets-big-on-oled-tvs-in-india-to-roll-out-rollable-tv-by-year-end-5823635/
Time taken: 10.145 seconds, Fetched: 3 row(s)
Also you can calculate sub subquery separately and pass it's result as a variable instead of cross join in my example. Hope you got the idea.
Upvotes: 1