Reputation: 21
I am have a table named 'LINK_INFO' with URLs in a field called 'URL'. The problem is, many duplicates URLs exist EXCEPT some have used a trailing / to get around the unique field requirement.
Example:
http://www.yahoo.com and http://www.yahoo.com/
What is the statement I can use to select these cases of near duplicates, so I can delete one of them? Many thanks if you can help.
Upvotes: 0
Views: 376
Reputation: 180937
You can just use TRIM
to find all unique values;
SELECT DISTINCT TRIM(TRAILING '/' FROM url) url
FROM link_info
To delete the duplicates right away, just do a delete join;
DELETE li1
FROM link_info li1
JOIN link_info li2
WHERE TRIM(TRAILING '/' FROM li1.url) =
TRIM(TRAILING '/' FROM li2.url)
AND li1.id<li2.id
Another SQLfiddle for testing.
Always back up your tables before running arbitrary SQL found on the net, even mine :)
EDIT: If your database machine is limited, you may want to do it using indexes and avoid loading more into memory than necessary;
-- remove all trailing slashes
UPDATE link_info
SET url=TRIM(TRAILING '/' FROM url);
-- create an index on the resulting strings (if there isn't already one)
CREATE INDEX url_index ON link_info(url);
-- delete all duplicates
DELETE li1
FROM link_info li1
JOIN link_info li2
WHERE li1.url = li2.url
AND li1.id<li2.id;
-- drop the index if not needed anymore
DROP INDEX url_index ON link_info;
Yet another SQLfiddle.
Upvotes: 1
Reputation: 34054
Maybe something like this?
CREATE TABLE link_info (
url varchar(255));
INSERT INTO link_info VALUES
('http://www.yahoo.com/'),
('http://www.yahoo.com');
SELECT DISTINCT
CASE
WHEN SUBSTRING(url, LENGTH(url)) = '/'
THEN LEFT(url, LENGTH(url) -1)
ELSE url
END AS `url`
FROM link_info
Result
| URL | ------------------------ | http://www.yahoo.com |
At this point you can insert this into a temp table, delete your parent and re-fill.
Upvotes: 0