Paisley
Paisley

Reputation: 91

BigQuery NOT EXISTS... NOT EQUAL TO (!=) and EXISTS... EQUAL TO Giving Different Results

I am lost and trying to figure out the reason I am getting different outputs for the same logic. I want to find out all calls (variants and non-variants) for each call set, and omits any call with a non-PASS filter.

The query is given at the link:

#standardSQL
SELECT
  call.name AS call_name,
  COUNT(1) AS number_of_calls
FROM
  `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call
WHERE
  NOT EXISTS (SELECT 1 FROM UNNEST(call.FILTER) AS call_filter WHERE call_filter != 'PASS')
GROUP BY
  call_name
ORDER BY
  call_name
Row call_name number_of_calls
1 NA12877 29795946
2 NA12878 26118774
3 NA12889 29044992
4 NA12890 28717437
5 NA12891 31395995
6 NA12892 25349974

This returns the number of rows that have filter as PASS.

But, when I try a similar logic I get a different result.

#standardSQL
SELECT
  call.name AS call_name,
  COUNT(1) AS number_of_calls
FROM
  `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call
WHERE
   EXISTS (SELECT 1 FROM UNNEST(call.FILTER) AS call_filter WHERE call_filter = 'PASS')
GROUP BY
  call_name
ORDER BY
  call_name
Row call_name number_of_calls
1 NA12877 4488086
2 NA12878 4503443
3 NA12889 4423974
4 NA12890 4529950
5 NA12891 4425316
6 NA12892 4497085

Why is this? Is my logic wrong? Can someone explain why empty array is a "PASS" in this case? Thank you for taking the time to help me!

Upvotes: 2

Views: 10459

Answers (1)

Paisley
Paisley

Reputation: 91

Thank you @shawnt00 for providing the solution. I will demonstrate with a few smaller examples that the extra counts are indeed empty arrays.
Query 1

with smalltable as (
    select ["PASS"] as filter, 'NA12877' as name union all 
    select ["PASS"], 'NA12877' union all 
    select ["PASS"], 'NA12879' union all 
    select ["PASS",'RefCall'], 'NA12877' union all 
    select ["RefCall"], 'NA12877' union all 
    select ["RefCall"], 'NA12877' union all 
    select ["RefCall"], 'NA12879' union all 
    select ["RefCall"], 'NA12879' union all 
    select ["RefCall"], 'NA12879' union all 
    select ["RefCall"], 'NA12879', union all
    select [], 'NA12879'   
)
select name, count(1) as count 
from smalltable 
where not exists (select 1 from unnest(filter) as f where f != 'PASS')
group by name

Output:

Row name count
1 NA12877 2
2 NA12879 2

As we can see the query considers [] as 'PASS'. I am not sure why?

Query 2

with smalltable as (
    select ["PASS"] as filter, 'NA12877' as name union all 
    select ["PASS"], 'NA12877' union all 
    select ["PASS"], 'NA12879' union all 
    select ["PASS",'RefCall'], 'NA12877' union all 
    select ["RefCall"], 'NA12877' union all 
    select ["RefCall"], 'NA12877' union all 
    select ["RefCall"], 'NA12879' union all 
    select ["RefCall"], 'NA12879' union all 
    select ["RefCall"], 'NA12879' union all 
    select ["RefCall"], 'NA12879', union all
    select [], 'NA12879'   
)
select name, count(1) as count 
from smalltable 
where exists (select 1 from unnest(filter) as f where f = 'PASS')
group by name

Output:

Row name count
1 NA12877 3
2 NA12879 1

This as expected does not count the empty array.

Upvotes: 3

Related Questions