64th
64th

Reputation: 3

How to see if a string in a table contains strings specified in a different table

I've been breaking my brains over the following problem. I have two BigQuery tables, one table with approximately 3 million search terms:

search_term
number of people named joe
how to paint a table black
top 100 pop songs
lovely horses
..

and a list of 8000 "keywords". Each row in the keyword table is a number of words.

keyword
name joe
horses 
baby kitten
song top 100
..

For each string in the 'query' table, I want to check if it contains any of the words that are in the 'keywords' table. However, the order in which the keywords appear in the 'query' string does not matter, and there can be other words inbetween. This is the resulting table I'm looking for:

search term                    contains_keywords
number of people named joe     TRUE
how to paint a table black     FALSE
top 100 pop songs              TRUE
lovely horses                  TRUE
..  

I've come up with the following code to check if each search term (as a whole) completely matches a keyword, but I have no clue how to go about splitting the keywords and then checking if each query contains these words. This is the code I have so far, but I'd greatly appreciate any help or pointers in the right direction.

SELECT 
  *
, CASE 
    WHEN search_term IN (
      SELECT
        keyword
      FROM 
        keywords)
    THEN true
    ELSE false
  END AS contains_keyword
FROM search_terms

EDIT: I've provided the two above sample tables to make it easier to try out your own code.


WITH 

search_terms AS (
  SELECT 'number of people named joe' AS search_term UNION ALL
  SELECT 'how to paint a table black' AS search_term UNION ALL
  SELECT 'top 100 pop songs'          AS search_term UNION ALL
  SELECT 'lovely horses'              AS search_term

  )

,keywords AS (
  SELECT 'name joe'         AS keyword UNION ALL
  SELECT 'horses lovely'    AS keyword UNION ALL  
  SELECT 'baby kitten'      AS keyword UNION ALL
  SELECT 'song top 100'     AS keyword
  )  
SELECT * FROM search_terms

Upvotes: 0

Views: 2002

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Below is for BigQuery Standard SQL

Using regex pattern consisting 8000+ words - can be quite a resource hog!
Below is workaround

#standardSQL
SELECT search_term, 
  ( SELECT COUNT(1)
    FROM UNNEST(SPLIT(search_term, ' ')) word
    JOIN UNNEST(keywords) word
    USING(word)
  ) > 0 AS contains_keyword
FROM `project.dataset.search_terms`,
UNNEST([STRUCT(ARRAY(
  SELECT DISTINCT keyword
  FROM `project.dataset.keywords`, UNNEST(SPLIT(keyword, ' ')) keyword
) AS keywords)])

If to apply to sample data from your question - output is

enter image description here

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269773

You can use regular expressions:

SELECT st.*,
       regexp_contains(search_term, k.pattern)
FROM search_terms st cross join
     (select string_agg(replace(keyword, ' ', '|'), '|') as pattern
      from keywords
     ) k;

This converts all the keywords into a single regular expression with the strings separated by |, which is regexp-speak for "or".

Upvotes: 0

Related Questions