Reputation: 478

Delete duplicate rows in MySQL where all columns match. Keep latest row

This is a topic, which has been discussed several times in here, but for me it's difficult to understand how to delete duplicated rows in our MySQL database. Yes, I have seen many examples, but they are simplified to 2 columns etc. and the solution is most of the time just query, which has not been explained. As I want to securely clean our database. I have backed up my table, and now I wan't to delete duplicated rows (as I have many of them). Below is a snippet of query, which I have tried, and it is fairly simple: if duplicated rows, pick latest row and show ID of the row.

SELECT MAX(id) FROM Raw_Validated_backup GROUP BY col1, col2, col3, ... col_n-1 having COUNT(*) > 1;

Now, as I have selected latest row of duplicated rows, I 'just' have to delete the duplicated rows. Easier said that done, if you ask me. This is a query written by @Jose Rui Santos:

delete test
  from test
 inner join (
  select max(id) as lastId, email
    from test
   where email in (
              select email 
                from test
               group by email
              having count(*) > 1
       )
   group by email
) duplic on duplic.email = test.email
 where test.id < duplic.lastId;

He replied to this post: MySQL delete duplicate records but keep latest

Question is, how do I exactly delete duplicated rows and keep the latest? Prefer to be in a pseudo like format, so others can learn from this as well. For example:

DELETE table FROM database

And not specific like:

DELETE email FROM city

MySQL Queries

EXAMPLE 1:

DELETE t1 FROM Raw_Validated_backup AS t1 INNER JOIN Raw_Validated_backup AS t2 
      ON t1.time_start=t2.time_start 
      AND t1.time_end=t2.time_end 
      AND t1.first_temp_lpn=t2.first_temp_lpn 
      AND t1.first_WL=t2.first_WL 
      AND t1.first_temp_lpn_validated=t2.first_temp_lpn_validated 
      AND t1.second_temp_lpn=t2.second_temp_lpn 
      AND t1.second_WL=t2.second_WL 
      AND t1.second_temp_lpn_validated=t2.second_temp_lpn_validated 
      AND t1.third_temp_lpn=t2.third_temp_lpn 
      AND t1.third_WL=t2.third_WL 
      AND t1.third_temp_lpn_validated=t2.third_temp_lpn_validated 
      AND t1.first_temp_rising=t2.first_temp_rising 
      AND t1.first_WR=t2.first_WR 
      AND t1.first_temp_rising_validated=t2.first_temp_rising_validated 
      AND t1.second_temp_rising=t2.second_temp_rising 
      AND t1.second_WR=t2.second_WR 
      AND t1.second_temp_rising_validated=t2.second_temp_rising_validated 
      AND t1.third_temp_rising=t2.third_temp_rising 
      AND t1.third_WR=t2.third_WR 
      AND t1.third_temp_rising_validated=t2.third_temp_rising_validated 
      AND t1.id<t2.id;

EXAMPLE 2:

DELETE FROM Raw_Validated_backup WHERE id NOT IN ( 
SELECT max(id) FROM Raw_Validated_backup 
GROUP BY time_start, time_end, first_temp_lpn, first_WL, first_temp_lpn_validated, second_temp_lpn, second_WL, second_temp_lpn_validated, third_temp_lpn, third_WL, third_temp_lpn_validated, first_temp_rising, first_WR, first_temp_rising_validated, second_temp_rising, second_WR, second_temp_rising_validated, third_temp_rising, third_WR, third_temp_rising_validated )

Upvotes: 1

Answers (2)

Shahzad

Reputation: 2063

The final solution in @Jose Rui Santos's answer will do exactly what you need.

DELETE
FROM table_name
WHERE id NOT IN (
    SELECT max(id)
    FROM table_name
    GROUP BY email
)

It will select all the IDs of the latest records for each group (the grouping being done on email in this case, but it can be on all of your columns as well), and then delete all the records that are not in the result of the query.

You can make sure it is, in fact, going to delete the right results by first just running the select:

SELECT max(id)
FROM table_name
GROUP BY email

This will show you everything that won't be deleted.

Upvotes: 0

Bill Karwin

Reputation: 562250

I'd do it this way:

mysql> select * from test;
+----+------------------+------+------+------+
| id | email            | col2 | col3 | col4 |
+----+------------------+------+------+------+
|  1 | [email protected] |    2 |    3 |    4 |
|  2 | [email protected] |    2 |    3 |    4 |
|  3 | [email protected] |    2 |    3 |    4 |
|  4 | [email protected] |    2 |    3 |    4 |
+----+------------------+------+------+------+

mysql> DELETE t1 FROM test AS t1 INNER JOIN test AS t2 
    ON t1.email=t2.email AND t1.col2=t2.col2 AND t1.col3=t2.col3 AND t1.col4=t2.col4 
    AND t1.id<t2.id;

mysql> select * from test;
+----+------------------+------+------+------+
| id | email            | col2 | col3 | col4 |
+----+------------------+------+------+------+
|  4 | [email protected] |    2 |    3 |    4 |
+----+------------------+------+------+------+

Explanation: The condition in the join looks for a row t2 such that the id is greater, and all other columns are equal. If such a row is found, then t1 must not be the row with the greatest id among the duplicates, therefore row t1 should be deleted.

Upvotes: 2

Delete duplicate rows in MySQL where all columns match. Keep latest row

Answers (2)

Related Questions