BigQuery joining tables on repeated fields

Question

I have one table that contains id as single column and multiple columns that are nested.

1)

An example schema for better understanding :

id - String,

childrenNames - String repeated,

animalNames - String repeated,

Another table contains only single columns

2)

An example schema for better understanding :

childrenName - String,

animalName - String

I need to know all the records from table 2) that are not in table 1) So both childrenName and animalName need to belong to one user, together.

I can add that I tried to select separately for each column from table 2) values that are 'IN' lists from table 1) but if it returns any row it can also mean both are belonging to two different ids (or more).

Example row table 1)

id: 1234,

childrenNames : ['Ana', 'Frank'],

animalNames : ['Rex', 'Max'],

Example rows table 2)

A)

childrenName : 'Ana',

animalName : 'Ozzy'

B)

childrenName : 'Frank',

animalName : 'Rex'

For above examples I should get the row A) from table 2) because 'Ozzy' does not belong to the id 1234 (lets suppose we do not have more records in table 1) )

Anyone idea how to solve such problem with BigQuery SQL (Standard or Legacy) ?

Mikhail Berlyant · Accepted Answer

Below is for BigQuery Standard SQL

#standardSQL
SELECT childrenName, animalName, ARRAY_AGG(DISTINCT id) users
FROM `project.dataset.table2`
CROSS JOIN `project.dataset.table1`
WHERE (SELECT COUNT(1) FROM UNNEST(childrenNames) cn WHERE cn = childrenName) > 0
AND (SELECT COUNT(1) FROM UNNEST(animalNames) an WHERE an = animalName) > 0
GROUP BY childrenName, animalName

You can test, play with above using data example for your question

#standardSQL
WITH `project.dataset.table1` AS (
  SELECT '1' id, ['Ana', 'Frank'] childrenNames,  ['Rex', 'Max'] animalNames 
), `project.dataset.table2` AS (
  SELECT 'Ana' childrenName, 'Ozzy' animalName UNION ALL
  SELECT 'Frank', 'Rex'
)
SELECT childrenName, animalName, ARRAY_AGG(DISTINCT id) users
FROM `project.dataset.table2`
CROSS JOIN `project.dataset.table1`
WHERE (SELECT COUNT(1) FROM UNNEST(childrenNames) cn WHERE cn = childrenName) > 0
AND (SELECT COUNT(1) FROM UNNEST(animalNames) an WHERE an = animalName) > 0
GROUP BY childrenName, animalName

with result

Row childrenName    animalName  users    
1   Frank           Rex         1

Note: field users in the output is a repeated string / array that consists list of users who have the search pair

The less verbose variation of above will be

#standardSQL
SELECT childrenName, animalName, ARRAY_AGG(DISTINCT id) users
FROM `project.dataset.table2`
CROSS JOIN `project.dataset.table1`
WHERE childrenName IN UNNEST(childrenNames)
AND animalName IN UNNEST(animalNames)
GROUP BY childrenName, animalName

with exact same result

So, obviously - use this second one :o)

... Table 1) has 5mln records, Table 2) 200k - thus Query exceeded resource limits

Try below version

#standardSQL
WITH flatten_table1 AS (
  SELECT id, childrenName, animalName
  FROM `project.dataset.table1`, 
  UNNEST(childrenNames) childrenName,
  UNNEST(animalNames) animalName
)
SELECT childrenName, animalName, id
FROM `project.dataset.table2`
JOIN flatten_table1
USING(childrenName, animalName)

BigQuery joining tables on repeated fields

Answers (1)

Related Questions