CuriousMind
CuriousMind

Reputation: 34145

Bigquery select non duplicate records

Consider the following table (simplified version):

id int,
amount decimal,
transaction_no,
location_id int,
created_at datetime

The above schema is used to store POS receipts for restaurants. Now, this table sometimes contains a receipt from same date, same transaction_no at same location_id.

In which case what I want to do is get the last receipt of that location_id & transaction_no order by created_at desc.

In MySQL, I use the following query which gets me last (max(created_at) receipt for a location_id & transaction_no:

SELECT id, amount, transaction_no, location_id, created_at
 FROM receipts r JOIN
 (SELECT transaction_no, max(created_at) AS maxca
  FROM receipts r
  GROUP BY transaction_no
 ) t
 ON r.transaction_no = t.transaction_no AND r.created_at = t.maxca
group by location_id;

But when I run the same in BigQuery, I get the following error:

Query Failed Error: Shuffle reached broadcast limit for table __I0 (broadcasted at least 150393576 bytes). Consider using partitioned joins instead of broadcast joins . Job ID: circular-gist-812:job_A_CfsSKJICuRs07j7LHVbkqcpSg

Any idea how to make the above query work in BigQuery?

Upvotes: 2

Views: 826

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

SELECT id, amount, transaction_no, location_id, created_at 
FROM (
  SELECT 
  id, amount, transaction_no, location_id, created_at, 
  ROW_NUMBER() OVER(PARTITION BY transaction_no, location_id 
                    ORDER BY created_at DESC) as last
  FROM your_dataset.your_table
)
WHERE last = 1

Upvotes: 4

Related Questions