Robbert Smit
Robbert Smit

Reputation: 90

Deduplicating rows in BigQuery based on condition

I'm currently trying to deduplicate rows within Google BigQuery for one of my tables. Basically, I have a table which bas rows containing duplicate values except for one single column. One example would be:

enter image description here

Please notice that all columns, except for the column "Ad group name" are the same. What I would like to do, is the following: keep a single row (wouldn't matter which one) if all columns are the same but the column ad group name is different.

I was thinking of creating partitions and using a rank function to denotate different values within that partition. Something like:

RANK() OVER (PARTITION BY Adgroup ID, date, Sales, Cost ORDER BY Ad group name) AS rank

Theoretically (hehe) this should lead to something like:

enter image description here

Using this I could filter in a new subquery using WHERE Rank = 1. This would remove all the duplicate rows in that case.

However, I'm finding that BigQuery doesn't support using FLOAT64 for partitioning, so my solution doesn't work. Also, I think there's a better way to do this but I'm having trouble finding out on how to do that. Which function in BigQuery can I leverage for this specifically?

Upvotes: 0

Views: 1033

Answers (2)

Jiho Choi
Jiho Choi

Reputation: 1311

There are multiple possible solutions for this case. One using partition by with analytic function and the other using group by with ARRAY_AGG().

(possible solutions of avoiding GROUP BY with any_value)

  1. ANALYTIC FUNCTION ROW_NUMBER
WITH
org_table AS (
    SELECT 15840 as AdGroupID, '22-1-2019' as AdDate, 'TVs' as AdGroupName, 800 as Sales, 200 as Cost
    UNION ALL SELECT 15840 as AdGroupID, '22-1-2019' as AdDate, 'Televisions' as AdGroupName, 800 as Sales, 200 as Cost
)
SELECT *
FROM org_table
WHERE TRUE
QUALIFY ROW_NUMBER() OVER (PARTITION BY AdgroupID ORDER BY AdDate DESC) = 1
;
  1. GROUP BY with AGG_ARRAY
WITH
org_table AS (
    SELECT 15840 as AdGroupID, '22-1-2019' as AdDate, 'TVs' as AdGroupName, 800 as Sales, 200 as Cost
    UNION ALL SELECT 15840 as AdGroupID, '22-1-2019' as AdDate, 'Televisions' as AdGroupName, 800 as Sales, 200 as Cost
)
SELECT
    AdgroupID,  -- primary key
    ARRAY_AGG(
        STRUCT(AdDate, AdGroupName, Sales, Cost)
        ORDER BY AdGroupName ASC LIMIT 1
    )[OFFSET(0)].*
FROM org_table
GROUP BY AdgroupID
;

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269773

You can use group by for this purpose:

select Adgroup ID, date, Sales, Cost, any_value(ad_group_name)
from t
group by Adgroup ID, date, Sales, Cost;

Upvotes: 3

Related Questions