Reputation: 646
I need to eliminate the duplicates of a table in snowflake, but there's an issue that I can't solve.
I'm using this code:
DELETE FROM int_ga.DIM_table
WHERE id , in (
SELECT id
FROM (
SELECT SK_DIM_CHANNEL
,ROW_NUMBER() OVER (PARTITION BY id , name ORDER BY id ) AS rn
FROM int_ga.DIM_table
)
WHERE rn > 1
);
Imagine this example table:
id | name |
---|---|
1 | example1 |
1 | example1 |
2 | example2 |
3 | example3 |
3 | example3 |
It's supposed to be like this:
id | name |
---|---|
1 | example1 |
2 | example2 |
3 | example3 |
But in the end, it eliminates both of the duplicates. I can't make it only delete one of them.
Upvotes: 1
Views: 484
Reputation: 176114
In scenario like this, when there is no unique id an alternative approach of reinserting all rows is possible(if table is relatively small):
INSERT OVERWRITE INTO int_ga.DIM_table(id, name)
SELECT id, name
FROM int_ga.DIM_table
QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY name) = 1;
More intrusive way is to recreate entire table:
CREATE OR REPLACE int_ga.DIM_table
COPY GRANTS
AS
SELECT *
FROM int_ga.DIM_table
QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY name) = 1;
The windowed function is not stable:
ROW_NUMBER() OVER (PARTITION BY id, name ORDER BY id )
as the ORDER BY id
is always the same(column id is already being used for partitioning.
If the goal is to leave single entry only per id
then it should be rather:
ROW_NUMBER() OVER (PARTITION BY id ORDER BY <some_column>)
Upvotes: 3