Reputation: 90
I'm currently trying to deduplicate rows within Google BigQuery for one of my tables. Basically, I have a table which bas rows containing duplicate values except for one single column. One example would be:
Please notice that all columns, except for the column "Ad group name" are the same. What I would like to do, is the following: keep a single row (wouldn't matter which one) if all columns are the same but the column ad group name is different.
I was thinking of creating partitions and using a rank function to denotate different values within that partition. Something like:
RANK() OVER (PARTITION BY Adgroup ID, date, Sales, Cost ORDER BY Ad group name) AS rank
Theoretically (hehe) this should lead to something like:
Using this I could filter in a new subquery using WHERE Rank = 1
. This would remove all the duplicate rows in that case.
However, I'm finding that BigQuery doesn't support using FLOAT64 for partitioning, so my solution doesn't work. Also, I think there's a better way to do this but I'm having trouble finding out on how to do that. Which function in BigQuery can I leverage for this specifically?
Upvotes: 0
Views: 1033
Reputation: 1311
There are multiple possible solutions for this case.
One using partition by
with analytic function
and the other using group by
with ARRAY_AGG()
.
(possible solutions of avoiding GROUP BY
with any_value
)
ROW_NUMBER
WITH
org_table AS (
SELECT 15840 as AdGroupID, '22-1-2019' as AdDate, 'TVs' as AdGroupName, 800 as Sales, 200 as Cost
UNION ALL SELECT 15840 as AdGroupID, '22-1-2019' as AdDate, 'Televisions' as AdGroupName, 800 as Sales, 200 as Cost
)
SELECT *
FROM org_table
WHERE TRUE
QUALIFY ROW_NUMBER() OVER (PARTITION BY AdgroupID ORDER BY AdDate DESC) = 1
;
GROUP BY
with AGG_ARRAY
WITH
org_table AS (
SELECT 15840 as AdGroupID, '22-1-2019' as AdDate, 'TVs' as AdGroupName, 800 as Sales, 200 as Cost
UNION ALL SELECT 15840 as AdGroupID, '22-1-2019' as AdDate, 'Televisions' as AdGroupName, 800 as Sales, 200 as Cost
)
SELECT
AdgroupID, -- primary key
ARRAY_AGG(
STRUCT(AdDate, AdGroupName, Sales, Cost)
ORDER BY AdGroupName ASC LIMIT 1
)[OFFSET(0)].*
FROM org_table
GROUP BY AdgroupID
;
Upvotes: 0
Reputation: 1269773
You can use group by
for this purpose:
select Adgroup ID, date, Sales, Cost, any_value(ad_group_name)
from t
group by Adgroup ID, date, Sales, Cost;
Upvotes: 3