Reputation: 85
I'm trying to select records from one table based on whether an item in a repeated field is located in a column in another table. I've been able to do this when explicitly listing the items I am testing for in my code, but not when selecting from another table. Let me demonstrate using the trigrams dataset:
Let's say that I wanted to select all records that had an appearance during certain years. But I don't just want just the data from those years - I want all the data associated with those records. If I just wanted the data for a few years I could do something like this (and this works):
SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count,
SOME(cell.value in ('1800', '1801')) WITHIN RECORD AS valid
FROM [publicdata:samples.trigrams]
HAVING valid
However, instead of coding '1800' and '1801' into my query, I have a table years
that contains a set of years I am interested in. I would like this to work:
SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count,
SOME(cell.value in (SELECT year_as_str FROM [mydataset.years])) WITHIN RECORD AS valid
FROM [publicdata:samples.trigrams]
HAVING valid
This doesn't work because bigquery requires that semi-joins be part of a WHERE
or HAVING
clause.
So I try to rearrange (back to the first query):
SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
HAVING SOME(cell.value in ('1801', '1802')) WITHIN RECORD
This results in an error Encountered " "WITHIN" "WITHIN "" ... Was expecting <EOF>
So now without the WITHIN RECORD
:
SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
HAVING SOME(cell.value in ('1801', '1802'))
This results in an error SELECT clause has mix of aggregations '...' and fields '...' without GROUP BY clause
But I'm not aggregating! So now I move the filter into a WHERE
:
SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
WHERE SOME(cell.value in ('1801', '1802'))
And this tells me that Invalid function name: SOME
. What?!
Is there a way to get the behavior I am looking for with BigQuery?
Upvotes: 1
Views: 574
Reputation: 3251
You can use the OMIT RECORD IF
clause for this. This may require double negation, since you need to omit records that meet some condition. The following query should work:
SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
OMIT RECORD IF EVERY(cell.value NOT IN ('1801', '1802'))
Upvotes: 0
Reputation: 173046
Below solves your example and I hope you will be able to extend it to your real use case (if you will like solution)
SELECT
ngram, cell.value, cell.volume_count,
cell.volume_fraction, cell.page_count, cell.match_count
FROM [publicdata:samples.trigrams] AS trigrams
JOIN (
SELECT ngram AS qualified
FROM (
FLATTEN((SELECT ngram, cell.value AS value
FROM (FLATTEN([publicdata:samples.trigrams], cell.value))), value)
) AS t
JOIN [mydataset.years] AS y
ON y.year_as_str = t.value
GROUP BY 1
) AS valid
ON valid.qualified = trigrams.ngram
Please pay attention to the fact that in [publicdata:samples.trigrams]
the field cell.value
is REPEATED STRING
- that is why you see "extra" FLATTEN things
Upvotes: 1