Reputation: 91
I am lost and trying to figure out the reason I am getting different outputs for the same logic. I want to find out all calls (variants and non-variants) for each call set, and omits any call with a non-PASS filter.
The query is given at the link:
#standardSQL
SELECT
call.name AS call_name,
COUNT(1) AS number_of_calls
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call
WHERE
NOT EXISTS (SELECT 1 FROM UNNEST(call.FILTER) AS call_filter WHERE call_filter != 'PASS')
GROUP BY
call_name
ORDER BY
call_name
Row | call_name | number_of_calls |
---|---|---|
1 | NA12877 | 29795946 |
2 | NA12878 | 26118774 |
3 | NA12889 | 29044992 |
4 | NA12890 | 28717437 |
5 | NA12891 | 31395995 |
6 | NA12892 | 25349974 |
This returns the number of rows that have filter as PASS.
But, when I try a similar logic I get a different result.
#standardSQL
SELECT
call.name AS call_name,
COUNT(1) AS number_of_calls
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call
WHERE
EXISTS (SELECT 1 FROM UNNEST(call.FILTER) AS call_filter WHERE call_filter = 'PASS')
GROUP BY
call_name
ORDER BY
call_name
Row | call_name | number_of_calls |
---|---|---|
1 | NA12877 | 4488086 |
2 | NA12878 | 4503443 |
3 | NA12889 | 4423974 |
4 | NA12890 | 4529950 |
5 | NA12891 | 4425316 |
6 | NA12892 | 4497085 |
Why is this? Is my logic wrong? Can someone explain why empty array is a "PASS" in this case? Thank you for taking the time to help me!
Upvotes: 2
Views: 10459
Reputation: 91
Thank you @shawnt00 for providing the solution.
I will demonstrate with a few smaller examples that the extra counts are indeed empty arrays.
Query 1
with smalltable as (
select ["PASS"] as filter, 'NA12877' as name union all
select ["PASS"], 'NA12877' union all
select ["PASS"], 'NA12879' union all
select ["PASS",'RefCall'], 'NA12877' union all
select ["RefCall"], 'NA12877' union all
select ["RefCall"], 'NA12877' union all
select ["RefCall"], 'NA12879' union all
select ["RefCall"], 'NA12879' union all
select ["RefCall"], 'NA12879' union all
select ["RefCall"], 'NA12879', union all
select [], 'NA12879'
)
select name, count(1) as count
from smalltable
where not exists (select 1 from unnest(filter) as f where f != 'PASS')
group by name
Output:
Row | name | count |
---|---|---|
1 | NA12877 | 2 |
2 | NA12879 | 2 |
As we can see the query considers [] as 'PASS'. I am not sure why?
Query 2
with smalltable as (
select ["PASS"] as filter, 'NA12877' as name union all
select ["PASS"], 'NA12877' union all
select ["PASS"], 'NA12879' union all
select ["PASS",'RefCall'], 'NA12877' union all
select ["RefCall"], 'NA12877' union all
select ["RefCall"], 'NA12877' union all
select ["RefCall"], 'NA12879' union all
select ["RefCall"], 'NA12879' union all
select ["RefCall"], 'NA12879' union all
select ["RefCall"], 'NA12879', union all
select [], 'NA12879'
)
select name, count(1) as count
from smalltable
where exists (select 1 from unnest(filter) as f where f = 'PASS')
group by name
Output:
Row | name | count |
---|---|---|
1 | NA12877 | 3 |
2 | NA12879 | 1 |
This as expected does not count the empty array.
Upvotes: 3