Reputation: 117
I have one table that contains id as single column and multiple columns that are nested.
1)
An example schema for better understanding :
id - String,
childrenNames - String repeated,
animalNames - String repeated,
Another table contains only single columns
2)
An example schema for better understanding :
childrenName - String,
animalName - String
I need to know all the records from table 2) that are not in table 1) So both childrenName and animalName need to belong to one user, together.
I can add that I tried to select separately for each column from table 2) values that are 'IN' lists from table 1) but if it returns any row it can also mean both are belonging to two different ids (or more).
Example row table 1)
id: 1234,
childrenNames : ['Ana', 'Frank'],
animalNames : ['Rex', 'Max'],
Example rows table 2)
A)
childrenName : 'Ana',
animalName : 'Ozzy'
B)
childrenName : 'Frank',
animalName : 'Rex'
For above examples I should get the row A) from table 2) because 'Ozzy' does not belong to the id 1234 (lets suppose we do not have more records in table 1) )
Anyone idea how to solve such problem with BigQuery SQL (Standard or Legacy) ?
Upvotes: 0
Views: 3422
Reputation: 172993
Below is for BigQuery Standard SQL
#standardSQL
SELECT childrenName, animalName, ARRAY_AGG(DISTINCT id) users
FROM `project.dataset.table2`
CROSS JOIN `project.dataset.table1`
WHERE (SELECT COUNT(1) FROM UNNEST(childrenNames) cn WHERE cn = childrenName) > 0
AND (SELECT COUNT(1) FROM UNNEST(animalNames) an WHERE an = animalName) > 0
GROUP BY childrenName, animalName
You can test, play with above using data example for your question
#standardSQL
WITH `project.dataset.table1` AS (
SELECT '1' id, ['Ana', 'Frank'] childrenNames, ['Rex', 'Max'] animalNames
), `project.dataset.table2` AS (
SELECT 'Ana' childrenName, 'Ozzy' animalName UNION ALL
SELECT 'Frank', 'Rex'
)
SELECT childrenName, animalName, ARRAY_AGG(DISTINCT id) users
FROM `project.dataset.table2`
CROSS JOIN `project.dataset.table1`
WHERE (SELECT COUNT(1) FROM UNNEST(childrenNames) cn WHERE cn = childrenName) > 0
AND (SELECT COUNT(1) FROM UNNEST(animalNames) an WHERE an = animalName) > 0
GROUP BY childrenName, animalName
with result
Row childrenName animalName users
1 Frank Rex 1
Note: field users
in the output is a repeated string / array that consists list of users who have the search pair
The less verbose variation of above will be
#standardSQL
SELECT childrenName, animalName, ARRAY_AGG(DISTINCT id) users
FROM `project.dataset.table2`
CROSS JOIN `project.dataset.table1`
WHERE childrenName IN UNNEST(childrenNames)
AND animalName IN UNNEST(animalNames)
GROUP BY childrenName, animalName
with exact same result
So, obviously - use this second one :o)
... Table 1) has 5mln records, Table 2) 200k - thus
Query exceeded resource limits
Try below version
#standardSQL
WITH flatten_table1 AS (
SELECT id, childrenName, animalName
FROM `project.dataset.table1`,
UNNEST(childrenNames) childrenName,
UNNEST(animalNames) animalName
)
SELECT childrenName, animalName, id
FROM `project.dataset.table2`
JOIN flatten_table1
USING(childrenName, animalName)
Upvotes: 1