In BigQuery, how to filter on json for rows where sum of some subelements is positive?

Question

Following up on How to get count of matches in field of table for list of phrases from another table in bigquery? Where you end up with something like:

Row str                     all_matches  
1   foo1 foo foo40          [{"key":"foo","matches":2},{"key":"test","matches":0}]   
2   test1 test test2 test   [{"key":"foo","matches":0},{"key":"test","matches":2}]

How could you further filter on those rows for which sum(matches over all keys) > 0 with StandardSQL?

Mikhail Berlyant · Accepted Answer

To keep it simple - just add below line to the end of referenced query

HAVING SUM(ARRAY_LENGTH(REGEXP_EXTRACT_ALL(str, CONCAT(key, r'[^\s]')))) > 0

So, the final query (BigQuery Standard SQL) will be

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'foo1 foo foo40' str UNION ALL
  SELECT 'test1 test test2 test' UNION ALL
  SELECT 'abc xyz'
), `project.dataset.keywords` AS (
  SELECT 'foo' key UNION ALL
  SELECT 'test'
)
SELECT str, 
  TO_JSON_STRING(ARRAY_AGG(STRUCT(key, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(str, CONCAT(key, r'[^\s]'))) AS matches))) all_matches
FROM `project.dataset.table` 
CROSS JOIN `project.dataset.keywords`
GROUP BY str
HAVING SUM(ARRAY_LENGTH(REGEXP_EXTRACT_ALL(str, CONCAT(key, r'[^\s]')))) > 0

with result

Row str                     all_matches  
1   foo1 foo foo40          [{"key":"foo","matches":2},{"key":"test","matches":0}]   
2   test1 test test2 test   [{"key":"foo","matches":0},{"key":"test","matches":2}]

Note: I added one more row into dummy data and it is filtered out from output because there is no matches at all in that row

In BigQuery, how to filter on json for rows where sum of some subelements is positive?

Answers (1)

Related Questions