Reputation: 2300
I have a BigQuery table with ~5k unique IDs. Every day new rows are inserted for IDs that may or may not already exist.
We use this query to find the most recent rows:
SELECT t.*
EXCEPT (seqnum),
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY id
ORDER BY date_of_data DESC
) as seqnum
FROM `[project]`.[dataset].[table] t
) t
WHERE seqnum = 1
Although we only want the most recent row for each ID, this query must scan the entire table. This query is slower and more expensive every day as the table size grows. Right now, for an 8GB table, the query above creates a 22MB table. We would much rather query the 22MB table if it could stay up-to-date.
Is it possible to create a materialized view that gets the latest rows for each ID?
Is there a better solution than growing tables to infinity?
Other requirements:
Upvotes: 4
Views: 991
Reputation: 10152
One of the solutions would be to partition your main table (with all rows) by column date_of_data
with a daily granularity.
Create a separate table which will keep only the most recent row for each ID. Populate it once with a single scan of entire main table and then update it every day by querying only the last day of the main table. Thanks to the partitioning querying the last day of the main table will scan only the last day of the main table.
Upvotes: 1