Jake
Jake

Reputation: 21

mySQL statement to find nearly duplicate data with trailing slash

I am have a table named 'LINK_INFO' with URLs in a field called 'URL'. The problem is, many duplicates URLs exist EXCEPT some have used a trailing / to get around the unique field requirement.

Example:

http://www.yahoo.com and http://www.yahoo.com/

What is the statement I can use to select these cases of near duplicates, so I can delete one of them? Many thanks if you can help.

Upvotes: 0

Views: 376

Answers (2)

Joachim Isaksson
Joachim Isaksson

Reputation: 180937

You can just use TRIM to find all unique values;

SELECT DISTINCT TRIM(TRAILING '/' FROM url) url
FROM link_info

SQLfiddle to test with.

To delete the duplicates right away, just do a delete join;

DELETE li1
FROM link_info li1
JOIN link_info li2
WHERE TRIM(TRAILING '/' FROM li1.url) =
      TRIM(TRAILING '/' FROM li2.url)
AND li1.id<li2.id

Another SQLfiddle for testing.

Always back up your tables before running arbitrary SQL found on the net, even mine :)

EDIT: If your database machine is limited, you may want to do it using indexes and avoid loading more into memory than necessary;

-- remove all trailing slashes
UPDATE link_info 
SET url=TRIM(TRAILING '/' FROM url);

-- create an index on the resulting strings (if there isn't already one)    
CREATE INDEX url_index ON link_info(url);

-- delete all duplicates
DELETE li1
FROM link_info li1
JOIN link_info li2
WHERE li1.url = li2.url
AND li1.id<li2.id;

-- drop the index if not needed anymore
DROP INDEX url_index ON link_info;

Yet another SQLfiddle.

Upvotes: 1

Kermit
Kermit

Reputation: 34054

Maybe something like this?

CREATE TABLE link_info (
  url varchar(255));

INSERT INTO link_info VALUES
('http://www.yahoo.com/'),
('http://www.yahoo.com');

SELECT DISTINCT
  CASE
    WHEN SUBSTRING(url, LENGTH(url)) = '/' 
      THEN LEFT(url, LENGTH(url) -1)
    ELSE url
  END AS `url`
FROM link_info

Result

|                  URL |
------------------------
| http://www.yahoo.com |

See a demo

At this point you can insert this into a temp table, delete your parent and re-fill.

Upvotes: 0

Related Questions