Why is BigQuery COUNT(*) of GROUP BY incorrect?

Question

I have a table with a unique integer ID and an AuditUpdated timestamp. I'm trying to divide it into groups of records by ID range like so:

SELECT CAST(ID / 100000 AS INT64) AS `Start`,
       COUNT(*) AS `RowCount`,
       MAX(AuditUpdated) AS `LastMod`
FROM `dataset.mytable`
GROUP BY `Start`
ORDER BY `Start`

This appears to work and I get the following for the first 3 records:

0, 25526, 2016-03-29 15:32:11.473 UTC
1, 65892, 2017-11-22 20:05:38.800 UTC
2, 48550, 2017-12-23 01:03:25.030 UTC

However, the COUNT(*) here is wrong. For example, I can run the following:

SELECT COUNT(*) FROM `dataset.mytable` WHERE ID < 100000

And the result is 61989! Not only that, the MAX(AuditUpdated) is actually 2017-03-30 22:47:19.153 UTC.

I have the same query running against the same table in a Microsoft SQL Server database and it works correctly. I can only assume BigQuery does things differently but I can't for the life of me figure out how!

rtenha · Accepted Answer

Run this toy example in BigQuery:

select 
  cast(x/10 as int64) as bucket,
  min(x) as min_x,
  max(x) as max_x
from unnest(generate_array(1,1000,1)) x
group by 1

In SQL Server, an INT/INT performs "integer division", which returns the whole integer part of the result and discards the remainder/decimal places (aka rounding down). Running select 5/3 should return 1 and not 1.666667~ to illustrate this.

BigQuery, on the other hand, returns a FLOAT64 when dividing INT64/INT64

As noted in BigQuery StandardSQL Documentation, casting a FLOAT64 to an INT64 "returns the closest INT64 value" (aka normal rounding). Your 1 bucket is actually counting all IDs between 50000 and 149999.

If you want to replicate the SQL Server logic, consider making the following edit to your BigQuery code: CAST(FLOOR(ID / 100000) AS INT64).

Why is BigQuery COUNT(*) of GROUP BY incorrect?

Answers (1)

Related Questions