Reputation: 1365
Here is my sql data
id location1 location2 distance
--------------------------------------------
1 Paris Marseille 150km
2 Paris Lyon 200km
3 Paris Strasbourg 300km
4 Paris Toulouse 350km
5 Marseille Paris 150km <-(almost) duplicate of row 1
6 Marseille Lyon 250km
...
Because the distance between Paris -> Marseille equals to Marseille -> Paris I want to remove one of the duplicated rows.
Table contains almost 1M rows, and half of them duplicates.How am I able to remove this duplicates for data on such a large table.
Upvotes: 2
Views: 135
Reputation: 2814
This is a situation where you can join the the table with itself:
DELETE FROM city WHERE id IN (
SELECT id
FROM city c1, city c2
WHERE c1.location1 = c2.location2 AND c2.location1 = c1.location2
AND c1.id < c2.id)
I assumed your table named city
As noted by miszyman, it is more efficient to avoid a subquery:
DELETE c1
FROM city c1, city c2
WHERE c1.location1 = c2.location2 AND c2.location1 = c1.location2
AND c1.id < c2.id
Upvotes: 2
Reputation: 1269753
If half (or nearly half) are duplicates, I would go with creating a temporary table and re-inserting the data:
create temporary table tempt as
select location1, location2, distance
from mydata t
where location1 < location2
union all
select location1, location2, distance
from mydata t
where not exists (select 1 from table t2 where t2.location1 = t1.location2 and t2.location2 = t1.location1);
truncate table mydata;
insert into mydata(location1, location2, distance)
select location1, location2, distance
from tempt;
For performance, you want an index on mydata(location1, location2)
:
create index idx_mydata_location1_location2 on mydata(location1, location2)
Upvotes: 0
Reputation: 33945
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table
(id INT NOT NULL
,location1 varchar(20) not null
,location2 varchar(20) not null
,distance INT NOT NULL
,UNIQUE(location1,location2)
);
INSERT INTO my_table VALUES
(1 ,'Paris','Marseille',150),
(2 ,'Paris','Lyon',200),
(3 ,'Paris','Strasbourg',300),
(4 ,'Paris','Toulouse',350),
(5 ,'Marseille','Paris',150),
(6 ,'Marseille','Lyon',250);
DELETE x
FROM my_table x
JOIN my_table y
ON y.location2 = x.location1
AND y.location1 = x.location2
AND y.distance = x.distance
AND y.id < x.id;
Query OK, 1 row affected (0.00 sec)
SELECT *
FROM my_table;
+----+-----------+------------+----------+
| id | location1 | location2 | distance |
+----+-----------+------------+----------+
| 6 | Marseille | Lyon | 250 |
| 2 | Paris | Lyon | 200 |
| 1 | Paris | Marseille | 150 |
| 3 | Paris | Strasbourg | 300 |
| 4 | Paris | Toulouse | 350 |
+----+-----------+------------+----------+
Upvotes: 1
Reputation: 2698
If all distances are twice in your database you could achieve it easily, if you just select the ones where location1 < location2
Upvotes: 1