franco pina
franco pina

Reputation: 333

Delete repeated data in Bigquery

I am optimizing a query in Bigquery that shows non-repeated data, currently it is like this and it works.

select * from (select ROW_NUMBER() OVER (PARTITION BY id) as num, id, created_at, operator_id, description from NAME_TABLE where created_at >='2018-01-01') where num=1

I wanted to ask if it is possible to make a GROUP BY with all the columns (in a simple way it cannot be done, since crated_at is not possible to group it) and keep the first data of created_at that appears for each id

PD:a DISTINCT does not work, since there are more than 80 million records (they increase 2 million per day) and it returns repeated data

Upvotes: 0

Views: 67

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1271241

You query should be fine. But you can do this without a subquery:

select array_agg(nt order by created_at desc limit 1)[ordinal(1)].*
from name_table nt
where created_at >='2018-01-01'
group by id

Upvotes: 0

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173210

Below is for BigQuery Standard SQL

#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY created_at LIMIT 1)[OFFSET(0)]
FROM `project.dataset.NAME_TABLE` t
WHERE created_at >='2018-01-01'
GROUP BY id

Instead of processing / returning all columns - you can specify exact list you need as in below example

#standardSQL
SELECT AS VALUE ARRAY_AGG(STRUCT(id,created_at,operator_id,description) ORDER BY created_at LIMIT 1)[OFFSET(0)]
FROM `project.dataset.NAME_TABLE`
WHERE created_at >='2018-01-01'
GROUP BY id   

Upvotes: 1

Related Questions