Cyrus
Cyrus

Reputation: 1022

How to remove duplicate rows in CockroachDB

I have a table in CockroachDB, I have populated the data into the table before applying the constraints to set the primary key, and because some of INSERT statement failed through the data-loading phase, some of the rows are loaded into the table more than one time by mistake.

The constraint I want to apply is:

CREATE UNIQUE INDEX ON "MyDB"."MyTable" ("Row_ID");

But as the duplicate data is already loaded into the table, I get the following error:

pq: multiple primary keys for table "MyTable" are not allowed

I have check to see if actually there are some duplicated rows with the following query:

SELECT  
    Row_ID,
    COUNT(Row_ID) as id
FROM    
    MyTable
GROUP BY 
    Row_ID
HAVING  
    COUNT(Row_ID) > 1;

And the query showed there are some duplicate rows.

What is the best way to remove the duplicate rows in CockroachDB?

Upvotes: 0

Views: 616

Answers (3)

alvitawa
alvitawa

Reputation: 393

If you dont care about the order:

DELETE FROM MyTable WHERE Row_ID NOT IN (
SELECT MIN(Row_ID) FROM MyTable GROUP BY Row_ID);

Upvotes: 0

Juan
Juan

Reputation: 5589

If they are an exact match you can create a new table with the distinct records.

Then delete all the old table and repopulate it from the table created in last step.

To create the table:

create table MyWorkData as select distinct * from MyTable;

To delete MyTable

delete from MyTable;

To repopulate MyTable. (Thankyou @benesch for correcting the statement)

INSERT INTO MyTable SELECT * FROM MyWorkData

Finally delte the working table.

Upvotes: 2

benesch
benesch

Reputation: 5269

If you don't care about which duplicated row you keep, run:

DELETE FROM MyTable WHERE rowid IN (
    SELECT MIN(rowid) FROM MyTable GROUP BY Row_ID HAVING COUNT(*) > 1
);

For any duplicates, that query will delete all but the row that was roughly created first.†

Note that rowid is not the same as your Row_ID column. rowid is an internal CockroachDB column that is magically created for tables with no primary key, and is guaranteed to be unique for each row in the table.

rowid stores <timestamp><node-id>, so it sorts roughly according to insertion time, but rows inserted near-simultaneously by nodes with skewed clocks may not sort in insertion order.

Upvotes: 3

Related Questions