Catarina Ribeiro
Catarina Ribeiro

Reputation: 646

How to delete duplicates in a table in snowflake, but leaving only one row?

I need to eliminate the duplicates of a table in snowflake, but there's an issue that I can't solve.

I'm using this code:

DELETE FROM int_ga.DIM_table
WHERE id  ,  in (
  SELECT id  
  FROM (
      SELECT SK_DIM_CHANNEL
          ,ROW_NUMBER() OVER (PARTITION BY id  , name ORDER BY id  ) AS rn
      FROM int_ga.DIM_table
  )
  WHERE rn > 1
);

Imagine this example table:

id name
1 example1
1 example1
2 example2
3 example3
3 example3

It's supposed to be like this:

id name
1 example1
2 example2
3 example3

But in the end, it eliminates both of the duplicates. I can't make it only delete one of them.

Upvotes: 1

Views: 484

Answers (1)

Lukasz Szozda
Lukasz Szozda

Reputation: 176114

In scenario like this, when there is no unique id an alternative approach of reinserting all rows is possible(if table is relatively small):

INSERT OVERWRITE INTO int_ga.DIM_table(id, name) 
SELECT id, name
FROM int_ga.DIM_table 
QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY name) = 1;

More intrusive way is to recreate entire table:

CREATE OR REPLACE int_ga.DIM_table 
COPY GRANTS
AS
SELECT *
FROM int_ga.DIM_table 
QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY name) = 1;

The windowed function is not stable:

ROW_NUMBER() OVER (PARTITION BY id, name ORDER BY id  ) 

as the ORDER BY id is always the same(column id is already being used for partitioning.

If the goal is to leave single entry only per id then it should be rather:

ROW_NUMBER() OVER (PARTITION BY id ORDER BY <some_column>) 

Upvotes: 3

Related Questions