Reputation: 201
How can I delete the duplicate records from a Snowflake table?
ID Name
1 Apple
1 Apple
2 Apple
3 Orange
3 Orange
Result should be:
ID Name
1 Apple
2 Apple
3 Orange
Upvotes: 19
Views: 52017
Reputation: 644
This answer is inspired by Felipe Hoffa's, but fixes a limitation in case there are NULL
values in any of the columns relevant for duplication.
Consider a duplicate row with values col1='a', col2 IS NULL
:
If we choose the statement WHERE a.col1 = b.col1 AND a.col2 = b.col2
to identify rows to delete, this row will not be deleted because NULL=NULL
will always return NULL
based on Snowflake's NULL handling. This can be dangerous as some duplicates will be deleted, while others will stay in place.
The solution is to replace equality by IS NOT DISTINCT FROM
:
CREATE TABLE dup_table AS
WITH dups AS (
-- Choose all columns from original table to be inserted in second step
SELECT m.*
-- choose columns to be considered to find duplicates
ROW_NUMBER() OVER (PARTITION BY col1, col2 ORDER BY col1, col2) AS row_num
FROM MY_TABLE m
)
-- select a single instance of duplicate row.
-- use all columns from original MY_TABLE but row_num
SELECT col1, col2
FROM dups
WHERE row_num = 2;
-- delete all instances of duplicates from original tables handling NULLs
DELETE FROM MY_TABLE m
USING DUP_TABLE d
WHERE (m.col1, m.col2) IS NOT DISTINCT FROM (d.col1, d.col2);
-- insert a single instance of duplicated rows into original table
INSERT INTO MY_TABLE
SELECT col1, col2
FROM dup_table;
Upvotes: 0
Reputation: 5273
If you are using Snowpipe (or COPY INTO
) to load the data, you can use METADATA$FILE_ROW_NUMBER
to differentiate the records.
Duplicates can happen often when using AWS Firehose to load data to S3 (or even directly into Snowflake), because PUT record operations must sometimes be retried (which is the recommended way to handle errors; see documentation here and here):
If there is an internal server error or a timeout, the write might have completed or it might have failed. If FailedPutCount is greater than 0, retry the request, resending only those records that might have failed processing. This minimizes the possible duplicate records and also reduces the total bytes sent (and corresponding charges). We recommend that you handle any duplicates at the destination.
So you might end up with a DELETE
query like this:
DELETE FROM my_table AS t
USING (
SELECT id, file_name, file_row_number
FROM my_table
QUALIFY ROW_NUMBER() OVER (
PARTITION BY id
ORDER BY file_name, file_row_number
) > 1
) AS dupes
WHERE t.id = dupes.id
AND t.file_name = dupes.file_name
AND t.file_row_number = dupes.file_row_number;
Where:
id
is the primary/unique key (which could be multiple fields), which is the field(s) against which duplicates are found.file_name
is the METADATA$FILENAME
file_row_number
is the METADATA$FILE_ROW_NUMBER
.METADATA$START_SCAN_TIME AS load_time
, with something like WHERE load_time > CURRENT_DATE() - 1
in the SELECT
and the DELETE
.And once your source data is free of duplicates, there are no more problems dealing with duplicates downstream. At that point, those duplicates could only come from a SQL programming error (usually a bad join), so:
Upvotes: 1
Reputation: 1060
This is inspired by @Felipe Hoffa's answer:
##create table with dupes and take the max id
create or replace transient table duplicate_holder as (
select max(S.ID) ID, some_field, count(some_field) numberAssets
from some_table S
group by some_field
having count(some_field)>1
)
##join back to the original table on the field excluding the ID in the duplicate table and delete.
delete from some_table as t
USING duplicate_holder as d
WHERE t.some_field=d.some_field
and t.id <> d.id
Upvotes: -1
Reputation: 1367
The following solution is effective if you are looking at one or few columns as primary key references for the table.
-- Create a temp table to hold our duplicates (only second occurrence)
CREATE OR REPLACE TRANSIENT TABLE temp_table AS (
SELECT [col1], [col2], .. [coln]
FROM (
SELECT *, ROW_NUMBER () OVER(
PARTITION BY [pk]1, [pk]2, .. [pk]m
ORDER BY [pk]1, [pk]2, .. [pk]m) AS duplicate_count
FROM [schema].[table]
) WHERE duplicate_count = 2
);
-- Delete all the duplicate records from the table
DELETE FROM [schema].[table] t1
USING temp_table t2
WHERE
t1.[pk]1 = t2.[pk]1 AND
t1.[pk]2 = t2.[pk]2 AND
..
t1.[pk]n = t2.[pk]m;
-- Insert single copy using the temp_table in the original table
INSERT INTO [schema].[table]
SELECT *
FROM temp_table;
Upvotes: 2
Reputation: 2296
Here's a very simple approach that doesn't need any temporary tables. It will work very nicely for small tables, but might not be the best approach for large tables.
insert overwrite into some_table
select distinct * from some_table
;
The OVERWRITE
keyword means that the table will be truncated before the insert takes place.
Upvotes: 29
Reputation: 665
Not sure if people are still interested in this but I've used the below query which is more elegant and seems to have worked
create or replace table {{your_table}} as
select * from {{your_table}}
qualify row_number() over (partition by {{criteria_columns}} order by 1) = 1
Upvotes: -1
Reputation: 59175
Adding here a solution that doesn't recreate the table. This because recreating a table can break a lot of existing configurations and history.
Instead we are going to delete only the duplicate rows and insert a single copy of each, within a transaction:
-- find all duplicates
create or replace transient table duplicate_holder as (
select $1, $2, $3
from some_table
group by 1,2,3
having count(*)>1
);
-- time to use a transaction to insert and delete
begin transaction;
-- delete duplicates
delete from some_table a
using duplicate_holder b
where (a.$1,a.$2,a.$3)=(b.$1,b.$2,b.$3);
-- insert single copy
insert into some_table
select *
from duplicate_holder;
-- we are done
commit;
Advantages:
Upvotes: 31
Reputation: 31
Based on above ideas.....following query worked perfectly in my case.
CREATE OR REPLACE TABLE SCHEMA.table
AS
SELECT
DISTINCT *
FROM
SCHEMA.table
;
Upvotes: 3
Reputation: 71
this has been bothering me for some time as well. As snowflake has added support for qualify you can now create a dedupped table with a single statement without subselects:
CREATE TABLE fruit (id number, nam text);
insert into fruit values (1, 'Apple'), (1,'Apple'),
(2, 'Apple'), (3, 'Orange'), (3, 'Orange');
CREATE OR REPLACE TABLE fruit AS
SELECT * FROM
fruit
qualify row_number() OVER (PARTITION BY id, nam ORDER BY id, nam) = 1;
SELECT * FROM fruit;
Of course you are left with a new table and loose table history, primary keys, foreign keys and such.
Upvotes: 7
Reputation: 1366
Your question boils down to: How can I delete one of two perfectly identical rows? . You can't. You can only do a DELETE FROM fruit where ID = 1 and Name = 'Apple';
, then both rows will go away. Or you don't, and keep both.
For some databases, there are workarounds using internal rows, but there isn't any in snowflake, see https://support.snowflake.net/s/question/0D50Z00008FQyGqSAL/is-there-an-internalmetadata-unique-rowid-in-snowflake-that-i-can-reference . You cannot limit deletes, either, so your only option is to create a new table and swap.
Additional Note on Hans Henrik Eriksen's remark on the importance of update timestamps: This is a real help when the duplicates where added later. If, for example, you want to keep the newer values, you can then do this:
-- setup
create table fruit (ID Integer, Name VARCHAR(16777216), "UPDATED_AT" TIMESTAMP_NTZ);
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (2, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);
-- wait > 1 nanosecond
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);
-- delete older duplicates (DESC)
DELETE FROM fruit
WHERE (ID
, UPDATED_AT) IN (
SELECT ID
, UPDATED_AT
FROM (
SELECT ID
, UPDATED_AT
, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY UPDATED_AT DESC) AS rn
FROM fruit
)
WHERE rn > 1
);
Upvotes: 2
Reputation: 25928
If you have some primary key as such:
CREATE TABLE fruit (key number, id number, name text);
insert into fruit values (1,1, 'Apple'), (2,1,'Apple'),
(3,2, 'Apple'), (4,3, 'Orange'), (5,3, 'Orange');
as then
DELETE FROM fruit
WHERE key in (
SELECT key
FROM (
SELECT key
,ROW_NUMBER() OVER (PARTITION BY id, name ORDER BY key) AS rn
FROM fruit
)
WHERE rn > 1
);
But if you do not have a unique key then you cannot delete that way. At which point a
CREATE TABLE new_table_name AS
SELECT id, name FROM (
SELECT id
,name
,ROW_NUMBER() OVER (PARTITION BY id, name) AS rn
FROM table_name
)
WHERE rn > 1
and then swap them
ALTER TABLE table_name SWAP WITH new_table_name
Upvotes: 12
Reputation: 2850
Snowflake does not have effective primary keys, their use is primarily with ERD tools. Snowflake does not have something like a ROWID either, so there is no way to identify duplicates for deletion.
It is possible to temporarily add a "is_duplicate" column, eg. numbering all the duplicates with the ROW_NUMBER() function, and then delete all records with "is_duplicate" > 1 and finally delete the utility column.
Another way is to create a duplicate table and swap, as others have suggested. However, constraints and grants must be kept. One way to do this is:
CREATE TABLE new_table LIKE old_table COPY GRANTS;
INSERT INTO new_table SELECT DISTINCT * FROM old_table;
ALTER TABLE old_table SWAP WITH new_table;
The code above removes exact duplicates. If you want to end up with a row for each "PK" you need to include logic to select which copy you want to keep.
This illustrates the importance to add update timestamp columns in a Snowflake Data Warehouse.
Upvotes: 7