String matching in URL using Hive / Spark SQL

Question

I have two tables, one containing list of URL and other having a list of words. My requirement is to filter out the URLs containing the words. For eg:

URL

https://www.techhive.com/article/3409153/65-inch-oled-4k-tv-from-lg-at-a-1300-dollar-discount.html
https://www.techradar.com/in/news/lg-c9-oled-65-inch-4ktv-price-drop
https://www.t3.com/news/cheap-oled-tv-deals-currys-august
https://indianexpress.com/article/technology/gadgets/lg-bets-big-on-oled-tvs-in-india-to-roll-out-rollable-tv-by-year-end-5823635/
https://www.sony.co.in/electronics/televisions/a1-series
https://www.amazon.in/Sony-138-8-inches-Bravia-KD-55A8F/dp/B07BWKVBYW
https://www.91mobiles.com/list-of-tvs/sony-oled-tv

Words

Sony
Samsung
Deal
Bravia

Now I want to filter any URL that has any of the words. Normally i would do a

Select url from url_table where url not like '%Sony%' or url not like '%Samsung%' or url not like '%Deal%' or not like '%Bravia%';

But that's a cumbersome and not scalable way to do it. What is the best way to achieve this? How do I use a not like function to the words table?

leftjoin · Accepted Answer

Using regex:

 where url not rlike '(?i)Sony|Samsung|Deal|Bravia'

(?i) means case insesitive.

And now let's build the same regexp from the table with words.

You can aggregate list of words from the table and pass it to the rlike. See this example:

with 

initial_data as (--replace with your table
select stack(7,
'https://www.techhive.com/article/3409153/65-inch-oled-4k-tv-from-lg-at-a-1300-dollar-discount.html',
'https://www.techradar.com/in/news/lg-c9-oled-65-inch-4ktv-price-drop',
'https://www.t3.com/news/cheap-oled-tv-deals-currys-august',
'https://indianexpress.com/article/technology/gadgets/lg-bets-big-on-oled-tvs-in-india-to-roll-out-rollable-tv-by-year-end-5823635/',
'https://www.sony.co.in/electronics/televisions/a1-series',
'https://www.amazon.in/Sony-138-8-inches-Bravia-KD-55A8F/dp/B07BWKVBYW',
'https://www.91mobiles.com/list-of-tvs/sony-oled-tv'
) as url ) ,


words as (-- replace with your words table
select stack (4, 'Sony','Samsung','Deal','Bravia') as word
),

sub as (--aggregate list of words for rlike
select concat('''','(?i)',concat_ws('|',collect_set(word)),'''') words_regex from words
)

select s.url
  from initial_data s cross join sub  --cross join with words_regex 
  where url not rlike sub.words_regex --rlike works fine

Result:

OK
url
https://www.techhive.com/article/3409153/65-inch-oled-4k-tv-from-lg-at-a-1300-dollar-discount.html
https://www.techradar.com/in/news/lg-c9-oled-65-inch-4ktv-price-drop
https://indianexpress.com/article/technology/gadgets/lg-bets-big-on-oled-tvs-in-india-to-roll-out-rollable-tv-by-year-end-5823635/
Time taken: 10.145 seconds, Fetched: 3 row(s)

Also you can calculate sub subquery separately and pass it's result as a variable instead of cross join in my example. Hope you got the idea.

String matching in URL using Hive / Spark SQL

Answers (1)

Related Questions