Reputation: 23
When using LAST or FIRST in a query that runs over multiple tables, the results returned are not consistent. Executing the same query again will return different results every time.
The values returned do match however, the correct LAST and FIRST values of one of the tables in the query. However, the table used in each case seems to be randomly selected.
For instance, the query below:
SELECT
FIRST(timestamp) as first_ts,
FROM TABLE_DATE_RANGE([some_table]_, timestamp('2016-07-21'), timestamp('2016-07-22'))
will return:
first_ts
2016-07-22 07:35:30 UTC
the first time is run, and:
first_ts
2016-07-21 23:16:26 UTC
the next time is run. And so on.
The expected result is 2016-07-21 23:16:26
. However, the value 2016-07-22 07:35:30 UTC
is the first timestamp of the table [some_table]_20160722
, so what I believe it's happening is that BigQuery does not allow you to sort the tables when querying across tables. As a result, FIRST and LAST become unreliable since the order of the tables is not known. Note that MAX and MIN do work well across tables.
This is true not just when using the legacy SQL, but also using the new SQL and the functions FIRST_VALUE and LAST_VALUE (since there is no support for FIRST and LAST anymore. Probably related?) See example below using new SQL.
SELECT
FIRST_VALUE(timestamp) OVER (ORDER BY event_sequence ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) first_ts,
FROM `some_table_*`
WHERE _TABLE_SUFFIX BETWEEN '20160721' AND '20160722'
Additionally (and this could be considered a separate issue) a similar behavior is observed when querying a single table where data has been appended multiple times to it. The results of FIRST and LAST become inconsistent. Each subset of data loaded, behaves as a partition.
This seems to me like a bug, unless someone knows a way to do this. Google Bigquery team, we love your product. But could you please fix this? We can't use FIRST and LAST aggregations in a reliable way, and that's a big deal for us. Thanks!
Upvotes: 2
Views: 1408
Reputation: 172993
We can't use FIRST and LAST aggregations in a reliable way, and that's a big deal for us
I don't think there is a bug here!
From documentation:
FIRST(expr)
Returns the first sequential value in the scope of the function.
Table by itself is not considered as an ordered sequence. BigQuery does not guarantee the order of output rows unless you provide some hint - for example ORDER BY
So in your example - you should use below. This will give you expected result in reliable way!
SELECT
FIRST(timestamp) AS first_ts
FROM (
SELECT *
FROM TABLE_DATE_RANGE([some_table]_, TIMESTAMP('2016-07-21'), TIMESTAMP('2016-07-22'))
ORDER BY timestamp
)
Note: of course MAX and MIN do work as they don't care of order
Upvotes: 2