Reputation: 333
I am optimizing a query in Bigquery that shows non-repeated data, currently it is like this and it works.
select * from (select ROW_NUMBER() OVER (PARTITION BY id) as num, id, created_at, operator_id, description from NAME_TABLE where created_at >='2018-01-01') where num=1
I wanted to ask if it is possible to make a GROUP BY with all the columns (in a simple way it cannot be done, since crated_at is not possible to group it) and keep the first data of created_at that appears for each id
PD:a DISTINCT does not work, since there are more than 80 million records (they increase 2 million per day) and it returns repeated data
Upvotes: 0
Views: 67
Reputation: 1271241
You query should be fine. But you can do this without a subquery:
select array_agg(nt order by created_at desc limit 1)[ordinal(1)].*
from name_table nt
where created_at >='2018-01-01'
group by id
Upvotes: 0
Reputation: 173210
Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY created_at LIMIT 1)[OFFSET(0)]
FROM `project.dataset.NAME_TABLE` t
WHERE created_at >='2018-01-01'
GROUP BY id
Instead of processing / returning all columns - you can specify exact list you need as in below example
#standardSQL
SELECT AS VALUE ARRAY_AGG(STRUCT(id,created_at,operator_id,description) ORDER BY created_at LIMIT 1)[OFFSET(0)]
FROM `project.dataset.NAME_TABLE`
WHERE created_at >='2018-01-01'
GROUP BY id
Upvotes: 1