Reputation: 3
I've been breaking my brains over the following problem. I have two BigQuery tables, one table with approximately 3 million search terms:
search_term
number of people named joe
how to paint a table black
top 100 pop songs
lovely horses
..
and a list of 8000 "keywords". Each row in the keyword table is a number of words.
keyword
name joe
horses
baby kitten
song top 100
..
For each string in the 'query' table, I want to check if it contains any of the words that are in the 'keywords' table. However, the order in which the keywords appear in the 'query' string does not matter, and there can be other words inbetween. This is the resulting table I'm looking for:
search term contains_keywords
number of people named joe TRUE
how to paint a table black FALSE
top 100 pop songs TRUE
lovely horses TRUE
..
I've come up with the following code to check if each search term (as a whole) completely matches a keyword, but I have no clue how to go about splitting the keywords and then checking if each query contains these words. This is the code I have so far, but I'd greatly appreciate any help or pointers in the right direction.
SELECT
*
, CASE
WHEN search_term IN (
SELECT
keyword
FROM
keywords)
THEN true
ELSE false
END AS contains_keyword
FROM search_terms
EDIT: I've provided the two above sample tables to make it easier to try out your own code.
WITH
search_terms AS (
SELECT 'number of people named joe' AS search_term UNION ALL
SELECT 'how to paint a table black' AS search_term UNION ALL
SELECT 'top 100 pop songs' AS search_term UNION ALL
SELECT 'lovely horses' AS search_term
)
,keywords AS (
SELECT 'name joe' AS keyword UNION ALL
SELECT 'horses lovely' AS keyword UNION ALL
SELECT 'baby kitten' AS keyword UNION ALL
SELECT 'song top 100' AS keyword
)
SELECT * FROM search_terms
Upvotes: 0
Views: 2002
Reputation: 172993
Below is for BigQuery Standard SQL
Using regex pattern consisting 8000+ words - can be quite a resource hog!
Below is workaround
#standardSQL
SELECT search_term,
( SELECT COUNT(1)
FROM UNNEST(SPLIT(search_term, ' ')) word
JOIN UNNEST(keywords) word
USING(word)
) > 0 AS contains_keyword
FROM `project.dataset.search_terms`,
UNNEST([STRUCT(ARRAY(
SELECT DISTINCT keyword
FROM `project.dataset.keywords`, UNNEST(SPLIT(keyword, ' ')) keyword
) AS keywords)])
If to apply to sample data from your question - output is
Upvotes: 1
Reputation: 1269773
You can use regular expressions:
SELECT st.*,
regexp_contains(search_term, k.pattern)
FROM search_terms st cross join
(select string_agg(replace(keyword, ' ', '|'), '|') as pattern
from keywords
) k;
This converts all the keywords into a single regular expression with the strings separated by |
, which is regexp-speak for "or".
Upvotes: 0