effe
effe

Reputation: 1365

Find duplicate rows in multiple columns

Here is my sql data

id      location1    location2      distance
--------------------------------------------
1       Paris        Marseille      150km
2       Paris        Lyon           200km
3       Paris        Strasbourg     300km
4       Paris        Toulouse       350km
5       Marseille    Paris          150km  <-(almost) duplicate of row 1
6       Marseille    Lyon           250km
...

Because the distance between Paris -> Marseille equals to Marseille -> Paris I want to remove one of the duplicated rows.

Table contains almost 1M rows, and half of them duplicates.How am I able to remove this duplicates for data on such a large table.

Upvotes: 2

Views: 135

Answers (4)

M. Page
M. Page

Reputation: 2814

This is a situation where you can join the the table with itself:

DELETE FROM city WHERE id IN (
  SELECT id
  FROM city c1, city c2
  WHERE c1.location1 = c2.location2 AND c2.location1 = c1.location2
  AND c1.id < c2.id)

I assumed your table named city

As noted by miszyman, it is more efficient to avoid a subquery:

  DELETE c1
  FROM city c1, city c2
  WHERE c1.location1 = c2.location2 AND c2.location1 = c1.location2
  AND c1.id < c2.id

Upvotes: 2

Gordon Linoff
Gordon Linoff

Reputation: 1269753

If half (or nearly half) are duplicates, I would go with creating a temporary table and re-inserting the data:

create temporary table tempt as
    select location1, location2, distance
    from mydata t
    where location1 < location2
    union all
    select location1, location2, distance
    from mydata t
    where not exists (select 1 from table t2 where t2.location1 = t1.location2 and t2.location2 = t1.location1);

truncate table mydata;

insert into mydata(location1, location2, distance)
    select location1, location2, distance
    from tempt;

For performance, you want an index on mydata(location1, location2):

create index idx_mydata_location1_location2 on mydata(location1, location2)

Upvotes: 0

Strawberry
Strawberry

Reputation: 33945

DROP TABLE IF EXISTS my_table;

CREATE TABLE my_table
(id      INT NOT NULL
,location1    varchar(20) not null
,location2      varchar(20) not null
,distance INT NOT NULL
,UNIQUE(location1,location2)
);

INSERT INTO my_table VALUES
(1 ,'Paris','Marseille',150),
(2 ,'Paris','Lyon',200),
(3 ,'Paris','Strasbourg',300),
(4 ,'Paris','Toulouse',350),
(5 ,'Marseille','Paris',150),
(6 ,'Marseille','Lyon',250);

DELETE x 
  FROM my_table x 
  JOIN my_table y 
    ON y.location2 = x.location1 
   AND y.location1 = x.location2 
   AND y.distance = x.distance 
   AND y.id < x.id;
Query OK, 1 row affected (0.00 sec)

SELECT * 
  FROM my_table;
+----+-----------+------------+----------+
| id | location1 | location2  | distance |
+----+-----------+------------+----------+
|  6 | Marseille | Lyon       |      250 |
|  2 | Paris     | Lyon       |      200 |
|  1 | Paris     | Marseille  |      150 |
|  3 | Paris     | Strasbourg |      300 |
|  4 | Paris     | Toulouse   |      350 |
+----+-----------+------------+----------+

Upvotes: 1

hellcode
hellcode

Reputation: 2698

If all distances are twice in your database you could achieve it easily, if you just select the ones where location1 < location2

Upvotes: 1

Related Questions